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Preface 


Recently it has been deemed important that remote sensing data analysis 
technology be developed so that earth resources programs initiated by 
several agencies, both in the United States and abroad, can be utilized 
effectively to assess fully available earth resources. Techniques when 
judiciously used could provide answers to certain important and pertinent 
problems facing mankind. Under the technology of remote sensing the multi- 
channel scanning devices are employed for securing earth information and 
identifying response pattern for each natural (earth) resource as completely 
and reliably as possible. This, however, gives rise to multivariate data 
and hence, handling of the large amounts of data requires a careful con- 
sideration of both the underlying physical properties and the data analysis 
techniques. The foremost problem after a region has been scanned from 
above by using airborne data scanning devices is that of recognition of 
different earth resources in the region. This emphasizes the importance 
of developing certain classification procedures that can meet any urgent 
demand of identification of the scanned ground. 

The greater part of the work presented in this report is addressed to 
the classification problem from a statistical viewpoint as applied to the 
remote sensing technology. It is easy to visualize that a remote sensing 
data will generally inherent a high degree of variability. This suggests 
being careful in associating a stochastic model to the underlying earth 
resources. The empirical approach for determining a model may not be 
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reliable because of uncertainty in identifying the resource of an observa- 
tion obtained via remote sensors. This has led investigators to assume the 
usual law of errors, and thereby, to consider the multivariate Gaussian 
probability model for the earth resource classes. Though the Gaussian model 
has many virtues, it is somewhat an arbitrary choice, and so a careful 
screening of the data is called for prior to associating any such model 
with the underlying classes. This and further consideration of computa- 
tional advantages has led us to suggest alternative models which we have 
called normed exponential densities in one of the given reports. The use 
of these density models which are appropriate in several physical situa- 
tions provides an exact solution for the probabilities of classifications 
(i.e. probability of misclassi fication and probability of correct classifi- 
cation) associated with the Bayes discriminant procedure even when the 
covariance matrices are unequal (a property not enjoyed in the normal density 
case). The computational difficulties that one faces in a complex situation 
such as remote sensing can be reduced to a certain degree if suggested 
normed densities are used as class models. 

In another report the problem of finding the probability that a random 
instantaneous field of views of a multi spectral scanning device consists of 
areas across class boundaries is discussed. Based upon an empirical study 

an estimate of this probability for our example was approximately .4. Such 

/ 

an amount of contamination is significant and it points out another potential 
irregularity that any remotely sensed data may have. 

Some of our reports deal with the problem of estimating probabilities 
of misclassi fication for the Bayes procedure as applied to Gaussian dis- 
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tributed classes, a well-studied problem in the classification theory. Both 
theoretical and empirical work has been done toward the investigation of the 
estimation problem. Further, the relationship between sample-size, feature- 
size, and Euclidean distance measure between classes was evaluated using a 
Monte Carol study. These results indicate that if the ratio of a sample- 
size to a feature-size becomes small, the probability of misclassification 
is increased accordingly. 

Also, we have obtained the minimum variance unbiased estimates (MVUE) 
of the probabilities of misclassification for the case of univariate normal 
probability models for the classes. Other ad hoc procedures previously 
given in the literature such as table look-up technique, have also been 
studied and certain modifications are suggested for making these more 
useful in remote sensing application. 

One report deals with clustering using dynamic programming. The 
problem of dimensionality of the observation vector has been treated from 
the point of both computation and reduction. The reduction of dimension- 
ality has been related to the probability of misclassification under the 
Bayes disciminant procedure. It is pointed out that such basis of probabil- 
ity of misclassification for reduction is not possible by using Karhunen- 
Loeve expansion method; and Wilks' dispersion techniques is recommended 
for the purpose of reducing dimensionality as it involves smaller risk 
in losing information concerning separability of the classes. 

The present work is a continuation of the research being conducted by 
us on the subject of remote sensing data analysis techniques. Besides the 
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classification and feature selection problems, other statistical investiga- 
tions such as the determination of sample size and the accuracy of estimates 
have been discussed in one of the reports. For the problem of estimating 
proportions for different categories of objects a model has been developed 
which takes into account the uncertainty that exists in classifying an 
object measured by remote sensor. More related problems are being con- 
sidered for the sampling scheme in obtaining sample observations from 
any remotely sensed data and deriving estimates of the actual amount or 
size of underlying classes. 

Our reports as listed in the table of contents contain results deserving 
of publication in professional journals. Publications based upon the re- 
sults of the following two reports have been accepted and are expected to 
appear soon in the respective journals: 

(a) Discriminant analysis using certain normed exponential densities, 
(Journal of Pattern Recognition, 1973) 

(b) On the table look-up in discriminate analysis, (Journal of Statistical 
Computation and Simulation, 1973) 

Other reports submitted for publication and being reviewed are: 

(c) An empirical study of classification by thresholding, (IEEE Transactions 

on Computers) / 

/ 

(d) A space application of an extension of the buffon needle problem, 

/ 

(Journal of American Statistical Association) 

(e) Concerning dimension reduction in discriminate analysis, (IEEE, Informa- 
tion Theory) • 
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(f) Effect of intraclass correlation among training samples on the linear 
discrimination procedure, (Journal of Pattern Recognition) 

(g) Estimation of proportion of objects and determination of training 
sample-size in a remote sensing application, (Journal of American 
Statistical Association). 
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An Empirical Study of Classification by Thresholding"* - 

b Y 

2 

Jo Tubbs, B. S. Duran, T. L. Bouillon, and P. L. Odell 


1 . Introduction 

Consider m populations tt^, and suppose each indi- 

vidual in the union of these populations possesses p common ob- 
servable characteristics c^, The observed values of 

T 

an individual are denoted by x = (x^,...,x ) , where denotes 

the observed value of c.. Let p (x) , p 5 (x),...,p (x) denote m 
known multivariate probability density functions .of the p-dimen- 
sional observation vector x and g^, ^ 2 ' ‘ be known a priori 
probabilities that an individual, I, be selected from a popula- 
tion tt^, n respectively. ' 

The classical discriminate analysis problem consists of 

formulating a technique for assigning an individual I selected 

m 


at random from IJ it. into one of the m populations. There 

1=1 1 

have been various techniques, proposed for solving the problem, 
of 1 which the Bayesian solution is optimal, in the sense that it 
minimizes the expected cost of misclassification. 

• In various applications of discriminate analysis, for example 
in the analysis of remote sensing data [6], the amount of compu- 
tation time is immense. Thus it is desirable to develop a tech- 


This research was supported in part by NASA Manned Space- 
craft Center under Contract NAS 9 - 12775. 
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nique which either reduces the number of calculations by reducing 
the dimension of the problem or which judiciously assigns indivi- 
dual I to its "most likely" population, while maintaining approx- 
imately the optimality of the Bayesian procedure. In this paper 
we investigate a discrimination procedure of the latter type, 
which is called classification by thresholding as formulated by 
Min ter and Hallum [3]. 

Before considering the thresholding procedure we first re- 
view the maximum likelihood classification technique. Suppose 

T 

the observed value x = (x^, , • . • , x ) is to be classified into 
one of m populations tt, , Tr~,...,ir . Assuming q.- = q, i = 1,2,..., 
m, the maximum likelihood procedure is to assign x to tt . if 
Pj (x) > p^ (x) for all .1 ^ j. For unequal a priori probabilities 
qi, q 2 , . • • ,q m the procedure becomes that of assigning x to 
if q^Pj (x) > q ± p i (x) for all i / j . 

In the thresholding procedure the maximum likelihood pro- 
cedure has been reformulated in the following manner: 

1. Obtain an observation x to be classified. 


2 . 


3. 


Select a density p^ (x) and evaluate it at x = X where 
the index i is such that q^ > q^ for all j ^ i. 
Compare p^(x) with threshold Tb where 


T. = max t. . 
1 j/i 13 


if Pi < x) > lb, classify X into population ik and go 
to step 1, otherwise go to step 4. 


4. If p^ (x) < lb go to step 2 and select another density 

function (next largest q^). If all the density functions 
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have been evaluated go to step 5. 

5= Assign X using the maximum likelihood decision rule. 

One would expect the computation time to be reduced by, using the 
thresholding technique since the thresholds can be computed prior 
to the discrimination operation and need to be computed only once. 

Consider the following example and figure to clarify the 
thresholding procedure. 

Example 1.1 . Let m = 3, p = 1, > q^ > and suppose T. , T 2 , 

and are known. Since q^ = max {q^, q 2 • ^ 3 ^' ^ is ,,niost likely" 
to be a member of tt^. Hence we first compute q-^p^(x) and compare 
it with T^. If q^p^(x) > T^, X is assigned to otherwise, 

compute q^p^(x) and compare it with T^. If X is not assigned to 
ir^, i.e., q^p^Cx) < , then the maximum likelihood procedure 

s' 

is used. If the classification is done according to the maximum 
likelihood procedure, then the actual classification will take 
longer than the usual Bayesian procedure, due to the extra time 
involved in thresholding. However, if X is classified in 
considerable amount of time has been saved since only one density 
function has had to be evaluated. 
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In this paper we compare the thresholding and maximum like- 
lihood procedures for four choices of three populations. 


2 „ Comparison of Thresholding and Maximum Likelihood Procedures 
Let p^(x), p^ (x) , . . . ,p m (x) be m multivariate normal proba- 
bility density functions and let q^, he m a priori 

probabilities corresponding to m normal populations ir, , . 

x 6 + m 

Then _ 


g i P i (x) = 


q. exp [-1/2 (x - J i ) T ^(x - ^) ] 
(2tt) p/2 | | 1/2 


where y. is the estimate of the mean vector. 7. is the estimate 
of the covariance matrix for population and p is the number 
of characteristics measured. 


If t.. denotes the threshold for populations, tt . and y . 
r j 3 


then 


q i p i (y i } " q j p j ^i* and 


min{q i p i (y i ) , q^Pj(yj)}, if 


t i3 


q i p i (y j > " (Vj> have 

same sign. 


max (q.p. (x) = q .p . (x) ) , otherwise, 

WeE J 3 


where E = {x: q . p • (x) = q.p. (x)}. Minter and Hallum [3] have 

J. J* J J 
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developed a procedure by which the set E can be determined. They 

define t. . to be 
3-D 


/ *^i 3-i ~ ~ 

/ min {In , In } , if q.p. (y . ) - q .p (y . ) and 

iS ± i 1L1 111 331 


t. . 
ID 




q i p i (y j ) ” have 

the same sign 


\min{ln | f ^ | - 2 In q. + (x^ - y i ) T £ i ' L (x^ - y i ) , 

otherwise 


(k) 

where the x are "candidates" determined by their procedure. 

Then the threshold for population is 

T. = min t . . , j = l,2,...,m, 

j^L 13 

where the classification is carried out as before except x is 
classified in ir. if 

l 

ln I I ± I - 2 In q ± + (x - y i ) T ^(x - y ± ) < T ± . 

3. A Monte Carlo Evaluation 

For the Monte Carlo simulation m = 3, p = 3, and n = 100. 

The 100 samples from each population were generated using the 
multivariate normal random generator described in [4]. Four 
trials were considered in each of which thresholding was compared 
with the classical Bayes technique when the parameters are unknown 



- 6 - 


and must be estimated. The same set of covariance matrices 
^2# and ^ we ^e used for all four trials. The four trials then 
differed only in "separation" among the mean vectors. The fol- 
lowing covariance matrices were used in each of the four trials 


/ 400 

-240 

-200 ^ 


-240 

. .400 

360 

9 

'-200 

360 

400 ^ 


• 

/ 400 

240 

-200 ^ 

• 

. 

240 

400 

-360 

9 

'-200 

-360 

400 1 


. / 400 

-240 

200 ^ 


-240 

400 

-360 

1 

• 

' 200 

-36 0 

400 ' 



The mean vectors for each trial were 


Trial 1: = (125, 150, 175) , 


M 2 = ( 150 , 175 , 125 ) T , 


T 


p 3 = ( 175 , 125 , 150 ) , 
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Trial 2: 


Trial 3: 


Trial 4: 



(115, 

150, 

185 ) T , 

(150, 

185, 

115) T , 

(185, 

115, 

T 

150) . 

(100, 

150, 

200 ) T , 

(150, 

200, 

100) T , 

(200, 

10 0 , 

T 

150) x . 

(50, 

150 , 

T 

250) , 

(150, 

200, 

50) T , 

' ^ 
o 
o 

CN 

50, 

150) T . 


The results of the comparison of thresholding and Bayes 
procedure for each of the four trials is given in Table 1. The 
time of 650 (.01 seconds) in the last row of the table was ob- 
tained only for trial 1. However, since the only difference 
among the four trials was the choice of mean vectors, one would 
expect about 650 (.01 seconds) for trials 2, 3, and 4, also. 

The a priori probabilities used were q^ = .5, = .3, and 

= .2. The thresholding classification procedure was carried 
out by classifying 100 observations from (say) , then 100 ob- 
servations from 7T ^ / and finally 100 observations from tt^. This 
is not the usual manner in which the classification should be 
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carried out, however, this method of classifying the 300 obser- 
vations will yield the minimum time for classification for 
thresholding. This allows us to assess the "best" that can be 
done by thresholding. For example, thresholding took 4.37 seconds 
as compared with 6.50 seconds for the Bayes procedure in trial 4 
in Table 1. In reality, however, thresholding would yield a 
value greater than 4.37 seconds whereas the Bayes procedure 
would still take 6.50 seconds. 

In remote sensing applications, such as in per field clas- 
sification [2] , the measurement vectors corresponding to a 
"small" area consisting of adjacent resolution cells, would tend 
to be from the same population. Thus one obtains a set of ob- 
servations from one population, followed by a set from another 
population, etc. The thresholding procedure would perform better 
(timewise) for data observed in this fashion than for data taken 
in the usual manner. The situation described in the previous 
paragraph is^an extreme case of data observed in a remote sensing 
application. 

4 . Feasibility of Thresholding as a Discriminate Technique 

It is evident from our empirical study that there are situ- 
ations in which thresholding is a feasible procedure, and situ- 
ations when one would be better off using the classical Bayes 
.technique. According to Tabid 1 thresholding is a "better" 
procedure for the situation in trials 3 and 4. However, in 
trials 3 and 4 the separation among the populations is sufficiently 
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TADLE 1 



Trial 1 

Trial 2 

Trial 3 

Trial 4 

Classified in ; ^ 

95 

99 

100 

100 

Classified in ^ 

' 

94 

I 

100 

100 

100 

Classified in ^ 

89 

100 

100 

100 

Misclassified 

22 

•0 

0 

i 

0 

Number of times 
classified using 
Bayes 

■ 

.78 

16 

0 

Time using 
Thresholding* 
(in .01 sec. ) 

7 07-- 

..545 

463 

437 

Time using 
Bayes* (in .01 sec.) 

650 

650 

650 

650 


* includes input-output time of 2 seconds 

large to yield a small probability of misclassif ication , in which 
case one could use some other procedure such as the "table look- 
up" technique, [1], [5]. The table look-up technique has been 

shown to require smaller computer time than the classical Bayes 
procedure [5] . 

In trial 1 the "closeness" of the populations forced the 
time for thresholding to be higher than for the Bayes technique. 
In this case 159 out of the 300 observations were classified 
according to the Bayes technique. Additional time was required 

































- 10 - 


for thresholding on the remaining. 141 observations. Also, at 
least one of the density functions p^x), (x) , and p^Cx) had 

to be evaluated for each of these 141 observations. 

The following example illustrates the relationship between 
separation among the populations and the feasibility of the 
thresholding technique. 

Example 4. 1 . Let m = 3, p = 1, and suppose q^, q^, and q^ are 
known (see Figure 2) . By the thresholding technique x is clas- 
sified in it. if x e Rf , i = 1,2,3. Thus the probability that 

x is classified using thresholding is the probability that 
3 

x £ U R'. Now extending the above argument to the general 
i=l 1 

case we have 

. m 

P (x classified using thresholding) = \ P (x £ R') 

r i=l r 1 

where R^ = (x: (x - y i ) T ^T 1 (x - y^.) £ Q i > , 

A iji A 

Qj. = ( Y i - V ± ) l L (y ± - p£) / and y £ is such that q i p i (y ± ) = T ± , 

T n — 1 

for i = 1,2, Since (x - y) 2, (x ~ V) has 
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a chi-square distribution with p degrees of freedom the proba- 
bilities P r (x e R^) , i = 1 , 2 ,... ,m, are easily obtained. These 
probabilities can be used to determine the number of times one 
can expect to classify according to the thresholding technique. 

The results in Table 1 indicate thatl4l, 222 , 284, and 300 
observations were classified by thresholding for trials 1, 2 , 3, 
and 4, respectively. The corresponding expected number of times 
thresholding would be used are 130 , "228, 28?, and 300 for trials 

1 , 2 , 3 , and 4 , respectively. 

5. Concluding Remarks 

From the similation study of this, paper it appears that 
there are situations in which the thresholding technique might 
be optimal in terms of time required for classification. One 
such situation is when the number of populations m and the number 
of observations to be classified are large. For example, in a 
remote sensing application the number of observations to be 
classified may be extremely large. 

The number of populations in our study was m = 3. It seems 
reasonable to expect the thresholding technique to be useful, 
for example, when m = 10 and the average number of density func- 
tions evaluated for each classification is 5 (say) . This de- 
pends, of course, on the degree of separation among the 10 popu- 
lations. Final emphasis should be placed on the fact that if 
there is a high degree of closeness among the m populations then 
one should use the classical Bayes or some other technique. In 
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summary, thresholding could prove useful when there are many 
populations with moderate separation among them. 
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A SPACE APPLICATION OF AN EXTENSION 
OF THE BUFFON NEEDLE PROBLEM 1 

by 

H. L. Gray and B. S. Duran 
Texas Tech University 


1. Introduction 

A problem of some interest in the current space program is 
that of collecting data from the y round through airborne remote 
sensors and using these data to classify the type of ground cover 
or vegetation below. The data collected are often in the form of 
a measure of the light energy radiated in various bands of the 
light spectrum, from a small square on the ground. When the in- 
terest is in classifying the data as coming from a finite number 
of populations, such as in field classification of an agricultural 
area, the problem is a multiple decision problem. Since this is 
a standard problem and an optimal solution is known only in the 
case of conditions which, here, are unrealistic, a number of solu- 
tions have been posed [1] . Regardless of which solution is uti- 
lized, its success is of course a function of the amount of noise 
in the data. This noise, although a function of many variables 
is strongly influenced by the altitude of the satelite or other 
collecting device. 

^ This research was supported in part by NASA Manned Spacecraft 
Center under Contract NAS9-12775. 



One of the more direct influences of the altitude on the data 
is the fact that the diagonal of the data square is a monotonic 
increasing function of the altitude and consequently so is that 
component of the noise which is due to the size of the square. 

In many instances then the question of primary interest is whether 
or not the data square is too large, i.e., whether or not certain 
types of remote sensing are feasible from altitudes as great at 
those of a satelite. Thus if' the data square is a one rnile square 
and the average field size of interest is a 1/2 mile square then 
no data will be obtained which is strictly from the population of 
interest. If those dimensions are reversed, it is still quite 
likely that very little data of interest will be collected. 

Briefly, the question is, for a square of a given size, how 
much data is likely to be obtained from the population of interest 
and how much will be a conglomerate of several populations? 

Although it is not a complete answer to the problem, a measure 

« 

which conveys essentially the information needed to decide the 
feasibility question is the probability, P , that a square dropped 
at random on a "map" will land in the interior of a "region". 

Put in terms more suggestive of the title of this paper, and equally 
informative, the measure of interest is the probability, P, that 
a square randomly placed on a map will cross a boundary line of 
one of the pieces which form the map. The purpose of this paper 
is to determine this latter probability for maps which are "quilts" 
(to be defined later) and to give some empirical evidence to show 
that the results can also be used for maps which are not "quilts". 
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2. Main Result 

Since a map can be formed by any 'collection of geometrical 
shapes it is impossible to find the probability, P, (defined 
above) for all possible maps. However it is usually possible (at 
least for agricultural fields) to approximate a given map by a 
finite collection of rectangles having at least one side in com- 
mon. Moreover, as we shall see, an analytical expression for P 
when our map is such a collection of rectangles which we will 
henceforth refer to as a quilt, can be obtained. Given any map 
our approach will therefore be to approximate it by a quilt, 
calculate the probability of crossing the boundary lines of the 
quilt, and approximate P by this probability. We now make the 
following definitions. 

[CQ] = Event that a square dropped at random on a quilt, 

Q, will cross a boundary line. 

[R. ) = Event that the center of a square dropped at random 
1 n 

will land in rectangle R. , where U R. defines 

1 i-=l 1 

the quilt, Q. 

A . = Area of R. . 
x x 

A = Area of the quilt. 

2w^ = width of R.. 

2h^ = length of R^. 

* * 

With this notation we can now write 

= l PtCQ|R.) P[R.] = ~ l P[CQ|R.]A.. 
i=l 1 1 A i=l x 1 


( 1 ) 


P [CQ] 
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A typical fall of the square center in a rectangle of dimen- 
sions 2w by 2h is displayed in Figure 1. 



Figure 1. 

In Figure 1. x and y denote the minimum distances from the center 
of the square to the vertical and horizontal edges, respectively, 
and e denotes the smallest angle that a diagonal makes with the 
horizontal. We assume that the length, 2d, of the diagonal of 
the square is such that d < w and d < h. This is a convenience 
and could be removed but realistically it does not seem necessary 
to do so. 
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In determining the probability of the square intersecting 
the quilt we first note that the square crosses the quilt if 
and only if at least one of the diagonals crosses it. Thus it 
is sufficient to determine the probability that at least one dia- 
gonal crosses the quilt. 

We will let and denote the diagonals that form the 
smallest and largest positive angles, respectively, with the 
horizontal. Further we assume the quantities X, Y, and 0 defined by 
figure 1 are independent uniform random variables over (0,w), 

(0,h) and ( 0 , tt/ 2 ) , respectively. 

In determining P[CQ|lb] we first consider the events 

[DjCjlb] = event that diagonal crosses the quilt, given 
the center of the square is in R^. 

[DjChVj = event that diagonal crosses a vertical line given 
the center of the square is in R^. 

[D-C-H] = event that diagonal D. crosses a horizontal line, 

J — 1 

given the center of the square is in R^. 

ID. C.V ] = event that diagonal D. crosses a vertical line only, 
given the center of the square is in b. 

^ D j C i Ii o^ = event that diagonal crosses a horizontal line 
only, given the center of the square is in R^ . 

[DjChVH] = event that diagonal crosses a horizontal and 
vertical line given the center of the square is 


in lb . 
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In view of these definitions v/e then have 


P [CQ I R ± ] r= P[D 1 CjR i ] + P[D 2 C{R i l - PKDjCJr^ 0 (D 2 C | R ± ) ] 


= 2P[D 1 cjR i ] - P[(D 1 C|R i ) (1 (D 2 C|R i )] 


2{P[D 1 C i V] + PfD 1 C i H] - P[(D 1 C i V) fl (D^H) ] } 


- P[ (D 1 C|R i ) 0 (D 2 C|R i ) 3 


However, 


P [ D, C . V ] = P(X < d cos 0) = -=- 
1 1 w. 7 r 


ir/2 d cos 0 

I I 


dxdO = 


i 0 0 


2d 

7TW . 
X 


P[D l C i H] = P<Y < d sin 0) = ~~ , 


and 


P [ (D, C . V) n (D. C . H ) ] = —— 7 -—— — 
lx lx' Tih.w. 


x x 


Therefore , 


( 2 ) 


PlD C I R ] = — (i— + - — — ) 

. lu i 1 i J 7T V h. 2h.w. ; 


X X XX. 


To determine P[D^cJr^) 0 (d 2 c[r^)J we observe that can 
cross a vertical line only, a horizontal line only, or both types 
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of lines. Similarly for the diagonal D 2 . Consequently, the 
event iD^cjR^l 0 [D 2 c|r^] can be decomposed into a complete system 
of nine events, i.e. , 


t D i c l R i> n [D 2 c| Ri ; = [(D lCi v 0 ) n (D-^vpi u t (D 1 c i v Q ) n 


u [(d.c.v ) n (d,c.vhi u ! (d c. ii ) n <d,c,v )'] 

XIO 2 1 X JL O 2x0 


U [(D 1 C i H o ) n (D 2 C i H Q ] u I (D 1 C i H o ) 0 (D^VH) ] 


U [(D^^ll) n (D 2 C i V o ) ] u [ (D-^C^VH) n (D 2 C 1 H o ) 3 


U r i( D J _C i VH) 0 D 2 C ± VH). 


Using this decomposition we obtain 


P[(D C.v ) n (D,C-V )) = P[(X < d cos e, Y < d sin e) 

•L X v X X v *“ 


(X < d cos 8, Y < d cos 8)] 


— [ (2 - /2)h. - 

tfhfWi i 2 J 


Similarly, 


p nViV n 'Wo” = ^ << 2 - - §” 
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PKD^VII) n (D 2 C i VFi) j = if - 1), 


PKViV n <D 2 C i H o> ! = P[(D l C i H o> n ‘ViV 1 


• TT 

ith.w. 1 2 
x x 


[t “ 1], 


PKDjQjVH) n (D 2 C i V o )] = P[(D l C i V o ) 0 ^ D 2 C i VH)] 


. A .— [i _ 1] 

TTh.w. LX 4 J ' 


X X 


Pl (D 1 C i VH) fl ( D 2 C i H o )] = P[(D l C i H o } 11 (D 2 C i VH) 


_i? [1 _ 2L] 

Trh i w i 4 


Collecting all these results we obtain 


(3) 


p f co I r 1 — ^ r ^ ^ 

P CQ | R ± J v l Wi h ± 2h i w i 4h i w i 


2d 


) 


„h.w. + V - + 5* ' ' 


X X 


By (1) and (3) the probability that the square crosses a 
quilt consisting of n patches is given by 


n 


P[CQ] = l P [CQ j R • ] (A. /A) 

i=l 1 1 


(4) 


n /2(w.+h.) - ^-(1 + ^-) 


M y { . 

di 


1 X 


Vi 


■} 4w.h. 
x x 
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(4) 


8d 

ttA 


J {/2(w. + 

i=l 


V - 


(1 + 


2 ' J 


since A^ = 4w^h^. If we let denote the perimenter of the 
i patch then the probability (4) may be written as 

(5) P[CQ] = ~ {/2 l P i - 2nd (1 + ir/2 ) } . 

i=l 

3. Examples and Applications 

It is of interest, and in fact quite simple, to verify the 
probability given by (4) empirically. For a particular rectangu 
lar region consisting of n rectangles , all one needs to do is 
choose a rectangle (at random) from the n rectangles and then 
choose x, y, and 6 in (0,w), (0,h), and (0,u/2) by means of a 
random number generator. With this information one can then 
determine whether the square, of fixed diagonal length, crosses 
the rectangle boundaries. If this procedure is repeated m times 
then an estimate of P[CQ] is given by 

A 

P = m^/m, 

where m^ denotes the number of times that the square crosses a 
boundary. 

A 

For n = 12 and d = 1 (see Figure 2) P = 0.7030 was obtained 
using m = 1000. The actual value of P[CQ], by (4), is 

r 

P[CQ) = 0. 7039. 



8 


3 


Figure 2 . 

As we previously stated, the probability that a square dropped 
at random on a map intersects at least one boundary of the map 
can be estimated by the probability in (4) . To demonstrate this 
an aerial photo of an agricultural area was considered. The map 
was first "covered" with an approximating quilt and then equation 
(4) was used to find the probability that the square crossed the 
quilt. This probability was then used to approximate P. The 
accuracy of such an estimate of course depends on how well the quilt 
fits the map. However, it is possible to check the accuracy in 
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this instance by generating points on the approximating quilt as 
described above and then by means of an overlay, note those 
instances where the quilt was crossed but not the map and visa 
versa. 

Figure 3 contains a particular map including an approximating 
quilt. The dimensions of the quilt are 20 cm. by 19.6 cm. and 
the map scale is 1 mi. = 10.4 cm. The diagonal of the square was 
2d = 0.85 cm. For m = 100 we obtained, empirically, 37 drops 
which crossed quilt and map boundaries, 3 which crossed map 
boundaries only, and 2 which crossed quilt boundaries only. We 

A A 

thus have the empirical results P [CQ] = .39 and P[CM] = .40. The 
actual value of P[CQ] by (4) is P[CQ] = .324. The large difference 

/A 

between P[CQ] and P[CQ] is probably due to the small number of 
simulations (m = 100) . However, the important observation is the 

A A 

closeness of P[CQ] and P[CM), and hence, these results indicate 
that P[CQ] may be used to approximate P[CM). 



*. J— 


* .1 1 1 


i ■ i r i • ■ t 


Figure 3. Agricultural map including quil’t: overlay. 
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CONCERN I NG DIMENSION REDUCTION 
IN DISCRIMINATE ANALYSIS 

1 .. Introduct ion 

Let tr | ,^ 2 , . • • *^m be m distinct populations with their individuals 
having p common observable characteristics. The samples obtained as ob- 
servations on these characteristics are then vectors from p-d imens iona 1 
real Euclidean space, the probability density functions p . (x) of the popu- 
lations are p-d imens ional density functions. When p is large compared to 

/ 

m, or large irrespective of magnitude of m, then we face an undesirable 
situation (from computational point of view) where statistical analysis 
involves (perhaps unnecessarily) large dimension of data vector. One way 
to avoid this situation is to compress the data before starting the actual 
statistical analysis by a "suitable linear transformation," to be referred 
to as compression matrix later, of the form 

Y = CX 

where X i s a p * 1 sample vector and C is a real r * p matrix satisfying 
certain conditions. This in statistical literature is referred to as 
"dimension reduction" technique and in the engineering literature as feature 
selection. There are at least two different approaches to the problem of 
selection of C, of which one due to Wilks is based on his concept of scatter 
and the other is based on so called Karhunen-Loeve expansion. Both methods 
are modifications of principal component analysis. In this paper we compare 
the two methods and show how they are related to total misclass i f icat ion 
probability cons i derat ion when the populations are all normal with equal 


covariance matrices. 
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2. Wilks' Technique for Dimension Reduction [5] 

Let {xf a ^ } (i = l ,2 , . . . ,N q ; a=l ,2 m) be m sets of samples from m 

populations tt ^ - » 7r m - Then the v/ i thin scatter matrix S w> between scatter 

matrix S 0 and the total scatter matrix S T are defined to be 
b * 


m 

m N 


a=l 

S ‘ - l l 

“ a=l i = 

(X 

= 1 

m 

N (X (a) - X) 

ft 

(X° 

a=l 

u 


m 

l ° (x 1 - 

i = l ' 

X) 

a=l 



X)', 


( 2 . 1 ) 


( 2 . 2 ) 


(2.3) 


where 


-l a (xf a) - x (a) ) (x< a) - x (a) ) T , x (a) -Y“ x!“>/N , (2.4) 

tt j_.il • . I tt 


N c 
i=l 


and 


m N 

l l 

a-1 i=l 


X = I l xf a) /(N ] +N 2 +...+N m ) . 


m 


(2.5) 


It is easy to show that S=S D +S... Let us denote these matrices by S v (x), 

I D W A 

S (x) and S T (x) respectively, so that the corresponding matrices of the 

D I 

transformed samples Y = C X of dimension r are denoted by S.,(y), $.(y) 

rxl rxp pxl W d 

and S T (t). J.t is evident that 

s w (y) = c S w (x)C T , S B (y) = C S b (x)C T , S T (y) = C S T (x)C T . (2.6) 

* 

The Wilks ' techn i que selects a_ r x p real matrix £ such that the _r columns 

y 

of C a re orthogonal p x | vectors and the ratio 

l s w (y) l/l s T (y) I 

i s minimi zed for a_ fixed val ue £ o£ | S^(y) J . ' 
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Using Lagrange multiplier method it is not difficult to show that such 
C will satisfy the following relations. 

|S B (x) - XS w (x) ) C T j= 0 (2.7) 

and C S (x) C T = K . (2.8) 

w 

Equation (2.7) has a nonzero solution if and only if 

|S B (x) - AS w (x) |= 0 (2.9) 

o 

It is well known (Rao [3] p. 37) that since S D (x) and S.,(x) are two p x p 
symmetric matrices of which S^(x) is positive definite (with probability 
one), then there exists a nonsingular matrix. P such that 

P T S w (x)P = I and P T S B (x)P = L (2.10) 

where 1^ is the p x p unit matrix and L is the diagonal matrix, 


T = diag { X ^ X } . 


Without loss of generality it can be assumed that Xj>^2>...>Xp . 

P^P ^ 0, the values of X for which |Sg(x) - XS^(x) | vanishes are identical 
with those for which |P^Sg(x)P - XP^Sy(x)P[ = | L— X 1 J vanishes. The nonzero 
roots of equation (2.9) are thus the nonzero elements among Xj,...,X . The 
number of such nonzero elements Xj,...,X is exactly equal to the rank of 
S B (x), which in its turn is equal to the dimension of the affine subspace 
spanned by the points X^,...,X^, which is (m-1) with probability one. 

If r = (m-1), that is if X is to be projected to a (m-1) dimensional 
subspace, then the (m-1) rows Cj,...,C^_jj of C will be chosen as orthogonal 


vectors such that 

(S B (X) - XjS w (X)) c! = 0 (j = l (m-1) , 

that is, cT is the eigen vector associated with the eigen value X j . The 


( 2 . 11 ) 
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vectors C[ C T . are then scaled so that |C S..(x)C^| = K. The value of 

I m- 1 w 

|S w (y) J / 1 S T (y) | is then a minimum for the above choice of C and 

min |S w (y) | _ |c S w (x)C T l I |S t ,(x)| 

|S T (y)| |C S t (x)C T | (1+X ] )...(l+A (n _ 1 ) |S T (x)| 

Thus if r = (m-1), the above choice of C not only minimizes | S^, (y) |/ 1 S^. (y) | , 
subject to the condition |Sy(y)[= K, but ?t conserves the ratio [S^|/[S^.| also. 

When r < (m-1), we select c|,...,cj as the eigen vectors corresponding 
to the r largest eigen values Xj,...,X such that 

(S B (x) - X j S w (x)cT = 0 (j = l,. ...r) (2.13) 

and |C S^(x)C^| = K. (2 . 1 A ) 

In this case 

min I s w (y) (/ I S y (y ) I = 1/ [ ( 1+Xj ) ... (1 + ^)] ± |s w (x) |/ |S T (x) | . 

It should be noted that the X's are random variables. When r = (m-1), 
the transformation Y = C X for the above choice of C is actually a projection 
onto the subspacc spanned by the vectors - X^,...,X^ - X^. 

3. Karhunen-Loe ve , Expansion Technique [2,3] 

(a) 

Let {X. w }(i = l,...,N a ; a=l,... ,m) be m sets of samples (p * 1 vectors) 
from m populations. The covariance matrix S of the grand sample is then 

A 

S = S t /(N,+. . .+N ) ' 

1 l m 

where S^. is the total scatter matrix. The generalized Karhunen-Loeve 
expansion theorem (due to Chien and Fu[2]) states that every sample vector 
x| a ^ can be represented as 
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where 


X. ta) ‘f a. (a) 

j=l *J J 


»W= , T x.<“> 

U J i 


and <f> j , . • • i <j)p are eigen vectors of S. Let the corresponding eigen values 
be , which without loss of generality can be assumed to be ordered, 

viz, Aj A ^ ... A . The compression matrix . C for compressing the observation 


rxp 


to r dimensions is then given by 


C 

pxr 


^ - * * 4* r ] • 

px] pxl 


The motivation of such choice of C lies in the fact that the mean squared 

error between xf 0 ^ and C xf°^ 

ii 


| |X. (a) - C xf a) | |= A + . . . +A 

' 1 i i 1 1 r+1 P 

is minimum. But this choice of C has no association (in general) with 

misclassif ication probability or separability of classes. Nothing can be 

/ 

said about the information regarding separability of classes, since it is 
not apparent what this transformation does to the between scatter matrix S 


B' 


Probability of Mi sc 1 ass i f ica t ion and Dimension Reduction 

Let the probability densities p.(x) be known to be normal Np(y.,Z) 

(i = l,...,m). Let x = (x, ,...,x )"*" denote a set of observations on the 

1 P 

p(common) characteristics of an individual i (x) , q the a-priori probability 

i 

of selecting an individual from w , C(jJi) the cost of miscl ass i fy ing an 
individual from ti . as being from n . . Let R = (Rj , . . . , R^) be a parti tion of 
the p-dimension at Euclidean space into m regi-ons defining a classification 
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procedure such that I (x) e it. whenever x e R.. Then the expected misclassi- 
fication cost for this procedure is given by 


m m 

Q (x,R) = l l [q ; Pi M C (j [ i ) 3 dx . 

P i = | j = l K j ' ' 

j/i 


(4.1) 


If C ( j J i ) = ! for all j / i and q. = 1/m for all i, then 

m m 

Q n (x,R) = 1 l I/" 1 P: (x) dx 
P i = l j=l j 1 

j/i 


m m 

= l l/m(l -/ p.(x)dx)=l-I f R p . (x) dx . 
i = l R i 1 i=l i 1 


(4.2) 


It is well known (Anderson [1] p 148) that the regions of classification 

"tZ 

R = (Rj,...,R”) that minimize the expected cost Qp(x,R) (or maximize the 

sum of probabilities of proper classification £./ p.(x)dx, in this case) 

1 R i 1 

are defined by ‘ 

JL 

R" = (x:U., (x)>0 for all k/j ) (j = l , . . . ,m) , 

J J K 


where 


-1 


Uj k (x) = [x - l/2(Pj+p k )] T l (pj-P k ) 


t4.3) 


and 


m 


Q (x,R~) = 1 - l / R . v p. (x) dx 
H j=I J J 


(4.4) 


As in dimension reduction technique, let us compress the p x ] observation 

vectors by a linear transformation Y = C X before starting classification. 

rxl rxp px] 

The probability density functions p. (y) of the transformed populations 
n j ( i=l , . . . ,m) are then given by 


_P; (y) = N r (cp. ,czc ) . 
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* A A A 

The classification regions in this r dimensional space, R = (R j , . . . , R^) , 
giving minimum expected mi scl ass i f i ca t ion cost (or maximum sum of probabil- 
ities of proper class i f icat ion, in this case) are now given by 

A A 

Rl = (Ujk(y) > 0 for all k / j) , (^-5) 

where U.k(y) = [y - 1/2 C (p .+p. ) ] T (CZC T ) 1 C(p.-p.) 

J J < • J K 

= [x - 1/2 (pj+p k ) ] T C T (CEC T ) l C(p^-p k ), (^.6) 

a j. fn . . 

and Q_(y,R") = 1 - I JZ P- (y)dy . 

j“l j J 

Our problem is to find C such that Q '(x,R''), that is, (in this case) 
probabilities of proper classification or mi scl ass i f i cat i on are not changed. 
This is possible if 


/ R * Pj(x)dx = /~ A p" (y) dy . (b.8) 

j J j J 


Let F denote the projection of E 


onto a r-d imens ional subspace S given by 


y = Cx. By definition Rj = r ( R j- ) and p^ (y) is the density of the probability 
measure P^ obtained as the projection onto this r-d imensional space of the 
probability measure with density p^ (x) . Then it follows from (^.5) that 
the probabilities of proper classification or mi sclass i f i cat ion are preserved 
if 


(1) P can be expressed as the product. of two measures, viz P =P x p 

p r r ’ p r p-r 

(2) Rv, for each j=l,...,m can be expressed as the cartesian product 

A 

of two sets, viz R* = R* E -S = P(R*) x E -S (E -S denoting 

J J P J P P a 

subspace orthogonal to S). 
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The first condition can always be met. But, as we will show, the second 

condition can be met only if T is a projection onto the subspace M 

spanned by the difference of the mean vectors, viz. p -p p -p , . V/ithout 

L I m I 

loss of generality we can assume Pj=0. Then M becomes the subspace spanned 

by p.,p_,...,p . Thus, we need to show that condition (2) is met if S = M. 
z 3 m 

Let the dimension of M, the space spanned by p^,...,p m (assuming Pj = 0) , 

t* 

be k; C be a k*p matrix such that y = CxeM for xeE^ and C a p-kxp matrix 

such that z = Cx E -M. Then CC^ = $. If X ^ N (p-E) then Y = CX N. 

p Y P k 

T * • * A T 

(Cp.CeC ), Z = CX 'v Np k (<j>, CEC ) and Y and Z are independent. Let 

(4.9) 
(4.10) 



Then, since Cpj = 0 for all j, we have 

U j k ( x ) “ lx - l/2(pj+p k )] T l ] (Pj-P k ) 

- V - 1/2 c(p.+p k ) T z T ](D _1 ) T r ] D _1 [C(p.-p k )^ 

= [y T - 1/2 C(p j +p k ) T Z T J(DED T ) _1 [C(p j -P k ) $] 

“ ly T - 1/2 c(p j +p k ) T ] (cec t ) -1 c(p j -p k ) ( 4 .ii') 

= {x - l/2(p j +p R )] T C T (CEC T )“ 1 C(p j -p k ) 


(4.12) 
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Thus for each j (j = m) we can write Rj as 


n 


R j " k/j {x: u jk (x) 0) 


k £j {y: [y T - 1/2 c( vV T )<crc T )-'c( v , k ) > 0) X(E p -m) 


R* X (E -M) 

J p 


(A. 13) 


We thus have the following theorem. 


Theorem 1 . Let C be a k x p matrix projecting orthogonally to the sub- 
space M spanned by the vectors p 2 - p j , • • • ,P m “p j , k being the dimension of M. 
Then the maximum sum of probabilities of proper classification (hence, for 
equal a priori probabilities and misclass i f ication cost, the expected mis- 
classification cost) is not altered even when the classification is done on 
the basis of the compressed observation y = Cx. 

Now let us further restrict C so that C.EC.^ = 0, for i ^ j and C.,C. 

i j J i j 

representing the i ^ and row of C respectively. With such C, CEC^ will 
be a diagonal matrix. Now let us define the between and within scatter 
matrices for these m populations respectively by 


m 


S B = l. , ( V* j) (lJ i -lj)T ’ 

i = l 


(4. wo 


m 


where 


P = l 

i = l ' 


and 


; w = m l • 


(A. 15) 


J 
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Let us further restrict C such that C.SgC^T = 0 for i ^ j . Then both 
C and C S W C T are diagonal matrices. Nov/ if we choose ^j > ^2 ••• > ^| < > ^ 

such that 

C.S 0 C. T = A . C . S C , T , 

J B j J J W j 


Then 


C($ B - AS W )C‘ = 0 . 


(^. 16 ) 


Conversely, if C is chosen as solutions of 

(S B - XS W )C T =0 

C will project orthogonally to M. ( 4 . 17) 

Thus we have the following theorem. 

Theorem 2 . Let C be a k * p matrix satisfying the equation 

(S B - *S W )C T = 0, 

where S n and S. . are defined as in (*1.1*0 and (*1.15). Then the maximum 
d W 

sum of probabilities of proper classification (or expected misc lass i f icat ion 
cost, in this case) is not altered if classification is made on the basis 
of compressed observation y = Cx* 

Now if C is not a projection onto the subspace spanned by y^-yj, 
rxp 


u -y, , then C(y.-y.) ^ <J> and hence 
m 1 j k v 


U jk (x) = ty T - l/2C( Mj +M k ) T ] (CIC 1 )' 1 C(|i--H k ) 




+ [Z T - l/2C(yj+p k ) T ] (CZC T ) 1 C(vu-y k ) 


Thus we see that in this case, Rv cannot be expressed in the form Rv x E - M, 

J J P 

A 

where Rv is a subset of the r dimensional subspace, the range of C. 
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5. Comparison of W i 1 k 1 s and Karhunen-Loeve Technique: An Example . 

Let us assume that we have same number (N) of samples from each of 

the m populations. Then we may note that Sg/N in 2.3 is an estimate of 
m - - T 

I c I (p.-p)(p.-p) in (^.14) and S R /N in (2.1) is an estimate of m£ . 
i = l B 

Thus equation (k . 17) is obtained from (2.7) if we replace S D and 5.. by 

D W 

m _ _ T 

][ (p.-p)(p.-p) and mE respectively. This leads us to expect intuitively 

i = l ' ' 

that mi sc 1 ass i f icat ion probabilities are preserved under Wilk's technique 
when C is m-1 x p. We now give an example for comparison of performance 
of Karhunen-Loeve expansion technique and Wilks' technique. 

Exampl e . Let p.(x) = N^(p.,r) (i=1,2).. As there are 2 populations, 

the means are coll inear. Let Pj = [0,0, 0]^ p 2 = [0,0,2] and 



6 0 0 
0 5 0 
0 0 1 


— \ T 


Then £ (p.-p) (p.-p) 
i = l 1 1 


0 0 0 
0 0 0 
0 0 2 


and hence S of Karhunen-Loeve expansion is given by 

S = l (p .-p) (p.-p) T + 2^ 
i = l ‘ ' 

12 0 O' 

0 10 0 
0 0 5. 
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The maximum eigen value is 12 and corresponding eigen vector 4* is given 
by 4>j ~ (1,0,0). Thus C = (1,0,0). Y = C X = X j . But there is no 
discriminant information in Y, since Y 'v N (0,12) given 1 (x) e tt j and 
V i N (0,12) given I (x) eir ^ . 

In case of Wilks’ technique, 

S B " * m I = 


0 

0 

0 


6 

0 

0 

0 

0 

0 

-2X 

0 

5 

0 

s> 

0 

2_ 


0 

0 

l 



0 

-10X 

0 


0 

0 

2-2X_ 


The nonzero root of | - XmZ j = 0 is X=1 . The eigen vector corresponding 

T 

to the eigen value X=1 is <p j = [0,0,1] and thus C = (0,0,1). Y = CX and 
Y % N(0,1) if I (x) , and Y s- N (2,1) if iMe^. This shows that there do 

exist discriminant information in Y. 


6. Concluding Remarks 

Among the engineers [2,3], a popular technique of "dimension reduction" 

v 

or "feature selection" is one based on so called Karhunen-Loeve expansion. 

We do not recommend this method for selecting the compression matrix C for 
the following reasons. 

0) There may be loss of information concerning the separability of 
classes. We have encountered an extreme case in the example of section 5T 
where after compression, two populations have become identical. 
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(2) There is no apparent relation between this method and misclassi- 
fication probability. It is not apparent what the compression matrix C 
selected by this method does to the mi scl ass i f i ca t i on probability. 

We rather recommend V/ i Iks' technique for selecting the compression 
matrix. We have noted in sections k and 5 how the selection of C by 
Wilks' technique is related to mi scl ass i f i cat ion probability and why 
it may be expected to preserve the mi sc 1 ass i f ica t ion probability. Besides, 
as the compression matrix C selected by Wilk's technique maximizes the 
between scatter matrix, it may be expected that there will be no loss in 
information concerning the separability of classes. 
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Discriminant Analysis Using Certain 
Normed Exponential Densities.^ - 

2 

R. S. Chhikara 
and 

3 

P. L. Odell 
1. ’ Introduction 

The statistical discriminant analysis technique is a rela- 
tively old one [1 ] , [ 2 ] ; yet in recent years there has been 
renewed interest generated primarily by the desire to develop 
automatic statistical recognition systems; both analog and 
digital [3], [4]. Electronic and optic scientists have 
developed multichannel spectral measuring devices [5] which 
when attached to aircrafts or space crafts take enormous amounts 
of data whose speedy reduction depends on rapid repeated perfor- 
mance of a discriminant algorithm using high speed computing 
machines. 

The purpose of this paper is to discuss the Bayes discrimi- 
nant analysis using certain normed exponential probability 
densities as models and to provide ways to reduce computations 
that are performed for discriminant analysis in the remote 
sensing application [6] , [7] . For clarity and completeness we 
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will rewiew briefly the classical statistical discriminant 

problem. Let I denote an individual belonging to one of m 

distinct populations. Assume that each member of the union 

of the m populations possesses a finite set of observable 

common characteristics or features which we denote by 

T 

C = (C^ ,C 2 , . . . , C ) whose observed values are denoted by 

T 

X = (x, ,x 5 , . . . ,x ) such that x. is the observed value of 
L ~ P D 

the characteristic , j = l,2,...,p. If one assumes that 

T 

the characteristics C = (C^ ,C 2 , . . . ,C ) are selected a priori, 
the discriminant problem can be summarized as follows: 


The Bayesian Discriminant Problem. Let II , ,H_ , . . . , II denote 
12 m 

m distinct populations whose known multivariate probability 
density functions of the p-dimensional measurement random 
vector x are denoted by p^ (x) , p 2 (x) , . „ . , p m (x), respectively. 
Let q 1 ,q 2 ,... / q m be the known a priori probabilities that an 
individual, I, be selected from a population * * ’ '^m' 

respectively. Let C(ijj) be the cost of misclassifying an 
individual from population Jij as being from population 
such that • 


o 

p 

V 

o 

i ¥■ 3 

*“ 1/2 / •••/ m 


= 0 

i = j 

1/3 1 , 2 , • • . , m # 

d.i) 


Given the p x 1 measurement vector x made on the characteristics 
of an individual, I, selected at random from the union of the 
populations Jl ^ ^ • ■* - , II r the problem is to formulate a decision 
rule R which miminizes the expected cost of misclassif ication 
for assigning I to one of the populations IK, i = l,...,m. 


/ 
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Let R = (r 1# R 2# . . . , R m ) denote an exhaustive partitioning 
of the Euclidean p-dimensional space into m mutually exclusive 


subsets such that if the observation vector x belongs to R 
then we assign the individual, I, which generated the observa- 


tion vector x to the population n ^ . Note that 


m m 

L(R) = l q. I C(j|i)P(j|i) (1.2) 

i=l , 1 j=l 

j^i 


is the expected cost of misclassif ication associated with an 


individual, I, where 


. P ( j ] i) = / p. (x)dx (1.3) 

R. 

3 

is the probability that x belongs to R^ given that the individual 
I, is from IK. Clearly, there exists many partitions R, such that 
L (R) in (1.2) is minimized. The following theorem proved in [1] 


summarizes the Bayesian approach for computing the optimal pro- 


cedure (partition) R. 


Theorem 1. The procedure R, that minimizes the expected cost 


of misclassif ication (1.2) is defined by assigning x to R. if 


m 

m 


.1 q iPi ( X )C(k|i) 

< l q.p.(x)C (j j i) 

“ i=l 1 

(1.4) 

i^k 

i^j 


j 

" 3L / 2 ^ • f rn « 



.Corollary 1.1. If C(ijj) = C for all i and j such that i ^ j, 
then (1.4) reduces to 

m m 

I q.p.(x) < l q.p. (X) , j = 1 , . . . ,m 

1=1 1 i=l 1 X 

i^j 


(1.5) 
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which is an ordering of the probabilities of misclassif ication , 
which is in 'turn equivalent to 

q k P k (x) = max (q i p i (x)} . ' (1.6) 

1 1 , . BO , m 

If further q^ = q^ for all i and j = 1 , 2 ,... ,m then (1.6) 
reduces to 

P k (x) = max {p. (x) ) , (1.7) 

i-1, ... /in 

the maximum likelihood solution of the discriminant problem. 

It is important to note that one must know a great amount 
in order to apply Theorem 1. Unfortunately, there are many cases 
in practice in which the a priori probabilities q^ , i = 1,2, ...,m 
are unknown. If C(j]i) are unknown or not assumed equal, then 
the problem is not very tractable, hence in most applications where 
C(ijj) are not known they are tacitly assumed equal for all 
i j. If q^,...,q m are unknown, one may assume that q^ = q^ 
for all i, j which implies that (1.4) is void of the q^'s. 

Another approach which requires a previously performed sampling 
task would be to estimate q^ , i = 1,2,... r m and then approximate the 

a 

Bayes solution to the problem with q^ = q^, i = l,2,...,m, in (1.4) 

A 

where q^ is an estimate of q_^. In the remote sensing application 
which interests us here, the assumptions C(ijj) = C and q^ = 1/m 
for all i / j and all i = l,2,...,m, respectively, are not in 
many cases unrealistic. The remote sensing problem can be summar- 
ized as follows: 

The Remote Sensing Problem . Let an image (or scene) be a rect- 
angular region with r rows (scan lines) and c columns (number of, 
resolution elements per scan line) consisting of rc resolution 


/ 



elements (individuals). Each cell (individual, I) generates 
a p * 1 measurement vector = {x^^} where h — 1,-2,..., p, 

i = 1,2, ... ,r, and j = 1,2, ... ,c. In order to recognize a 
single scene, one must perform rc discriminating tasks "as 
effectively as possible" (if the scene is classified point 
by point). For details, refer to [8], [9] and [10], 

The problem is conceptually a repeated application of 
multivariate discriminant analysis outlined earlier in which, 
if necessary, estimates of parameters are substituted for 
parameter values. That is, if p^(x), i = 1,2,..., m, denotes 
the unknown probability density function of a p x 1 measure- 
ment vector X associated with the ith population IK , i = 1,2, 
...,m, due to the unknown mean vector y^ and covariance matrix 
/ an estimate of p^(x) is 

A ^ A 

,Pi<x) = P ± ,y ± > X ± ) 

where /. and y. are estimates of y. and T.. 

X x x L x 

Suppose X = x denotes an observation from an individual 
we wish to classify. In one of the simplest cases where normal- 
ity holds and q^ = 1/m and C(ijj) = C for all i / j, the optimal 
solution is given by (1.6); or equivalently, the individual 
I = I (x) who generated the observation x is assigned to if 

A 

In p,(x) = max (In p. (x) } (1.8) 

i=l, 2 , . . . ,m 

where 
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ln 


- 1/2 (x - jm) T I i 1 (x - ^ i ) 


and 


K 


± = -p/2 In 2H - 1/2 In 


and in order to evaluate (1.6) and (1.8) one must always be 
able to evaluate the quadratic form 


Q ± = (x M i ) 1 J\ 1 (x - y ± ) 
for i - 1 , 2 , . . . ,m. 

2. On Eliminating the Normality Assumption 

One notes that a computational problem in the form of 
evaluating a quadratic is associated with the normality assump- 
tion. Experience has shown' that reasonable to excellent results 
in the form of minimal expected costs of misclassif ication are 
obtained when the normality assumption is made; hence gives 
empirical experience to support its value even though there 
exists cases in which one can reject statistically that the 
data is normal. 

. Since the normality assumption is an arbitrary choice of a 
model, a natural suggestion is to replace the assumption with 
a selection of an alternative multivariate probability density 
model which can realistically describe the density of the 
measurements and whose likelihood values are faster to compute. 
However, it is reasonable to conjecture that if one incorpor- 
ates correlation in any multivariate model it will necessarily 
imply quadratic terms to be evaluated. Nevertheless the follow- 
ing example gives us some experience in our attempt to formulate 

i ^ 

i 
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a multivariate density not unlike the multivariate normal 
but eliminates the quadratic form. 

Example 2.1 . For a random vector x , let E[x3 = y and E[(x-y) 

• ■ m ' r» — 1 

(x-y) 3 = l „• Since l is positive definite, then the quad- 
ratic form can be expressed as 

Q 2 = (x-y ) T B T B (x-y) 

where B is such that 


>-l T 
' = B B 


( 2 . 1 ) 


and is a unique lower triangular matrix. Now define 


Then 


Y = B (x-y ) = (Y Y ,...,Y ) 
q 2 = Y T Y = lY* 


,T. 


(2.3) 

(2.4) 


and E [Y } = <p , a null vector, and E [YY ) = I, a nxn unit 
matrix. 

Let us denote ||Y|j^ = ] y^ 1 r , 0<r<°°. Then Q 2 = 

2 i=1 

J J Y] ] 2 , the squared Euclidean distance of x from y weighted 

by the matrix Y ^ . For r=«> / denote ||Y|j = max (]Y |). 

r “ k=l , 2 , . . . ,p 

Note that ] ] Y | | and ] ] Y ] | are different measures of the weighted 


distance of x from y. However, it is much faster to compute ] j Y | 

2 

and ]]Y|] ro than ||Y|| 2 as these eliminate the quadratic form. We 
will elaborate on this aspect in section 5. 

Next, as we discussed in [173 , the evaluation of probabilities 
of classification P ( j J i ) , i and j = l,2,...,ra, defined in (1.3) is 
a difficult task when p^ (x) are assumed to be normal, and in part-, 
icular, a theoretical solution is completely out of order if covar- 
iance matrices i = l,2,...,m are not assumed equal. This is 



-48- 


because P(j|i) will involve multivariate normal probability 
integrals over arbitrary domains described by quadratic func- 
tions. This has led investigators in the past either to res- 
trict their discussion to a highly simplified form of mean 
vectors (i=l,2, . . . ,m) and = ][ for all i =. 1,2, . . . ,m or 
to seek refuge in a computer algorithm based upon approx-> 
imations which may be far from being accurate and even expensive 
due to a large number of repeated computations. 

With these considerations in mind, we propose the use of 
certain normed exponential densities given in the next section 
for the. Bayes discriminant analysis. These densities lead 
to minimum number of computations, piecewise linear discrimi- 
nant functions when there exists inequality among the covari- 
ance matrices (a property not attained under normality when 
unequal covariance matrices are assumed) and a theoretical 
solution regarding evaluation of probabilities of classifica- 
tion . 


3. Normed Exponential Density Functions 

In order to enlarge the class of density functions, we 

first define the r-hormed exponential density. 

( r ) ^ 

Definition 1 : For 0<r<~, f (y) is the r-normai expon- 


ential density function of a random vector Y = (Y^,...,Y ) 
if 


T 


where 


f (r) (y) = Ke“ c l ly I 


K = [^J^e CU du] ^ 


, c>0 


and c is determined so that E[YY ] = I 


(3.1) 
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Definition 2 ; For r = 05 the maximum normed exponential 

( 03 ) 

density function f (y) of a random vector Y is given by 

“ c !!yIL 


f {co) (y) = Ke 


c>0 


(3.2) 


where 


c = 


2 P (p+2) ! 

1/ (p+2) -i 

ar\r\ y: — ZL 

(p+2) ! 

3 

L 

diiki i\ — 1 

p • 

12 


(Again c was determined so that E[YY ] = I) 

( IT ) (°° ) 

Note that f (y) and f (y) are symmetrical about y = 0 
and cover a, wide range of multivariate density functions. 

For p = 1, c = /2 and K = .2 P/ ^ 2 , the density function is 
given by 

f^(y) = (1/2 P//2 ) exp (— 1^2? 1 y. | ) , -»<y <» (3.3) 

1 K k = 2,2, ... ,p 

which is the multivariate analog of the double exponential 
density and can be interpreted as the likelihood function when 
a set of p observations are sampled from the univariate popula- 
tion with p.d.f. 

p(y) = 1//2 , 


— co< V< 00 


For p = 2, c - 1/2 and K = (2tt) 
tion is 


-p/2 


the 2-normed density func- 


f (2) (y) = (1/2tt) p/2 exp (-1/2 l y 2 ) 


(3.4) 


which is the multivariate normal density with mean vector <{> and 

(2) 

covariance matrix I. Observe that f (y) is less peaked at y = 0 
as compared to f ^ (y) but more peaked when it is compared to 
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(r) 

Though f (y) in (3.1) leads to several other density 

'1) 

functions, here we will primarily be concerned with f * (y) 

{ co Y 

and f (y) which are suitable as models in various physical 
situations. In the next section, we discuss the problem of 
discrimination using these density functions and provide 
examples for better comprehension. 

4. Bayes Discriminant Procedures 

Consider the populations with the probability density 

(1) (°°) (1) 
functions p^ (x) or p^ (x) which can be obtained from f (y) 

( °° ) 

in (3.3) or f (y) in (3.2) for the random vector X = (x,,..., 
,T 

V ' 


X 


b r^Y 
1 


w i 


i = 1 , 2 , . . . ,m. (4.1) 


Accordingly, we have 
P 


(1) (x) . K 1 e' ,/ \hE ik (x- Ui )| 
i " ~ 7£72 1 


x — 1,2,..., m 

where B^ is the kth row of the matrix B^, that is 


B. = 
X" 


B il 

B iP 


(4.2) 


where each of B^, k = l,2,...,p, is a Ixp vector. Observe 
that the determinant ]B^J = 

Similarly, 


(co) , % 1 

Pi <x) = p l 


(p+2) I 


12 


p/(p+2) 


-c 


max 


where 


<| B ik (x-Pi)|) 

(4.3) 

x — 1,2,..., m 


B^ e k=l,2, . . . ,n 


2 


[ 2 P (p+2) I 


l/(p+2) 
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I 


4.1. Populations with density functions (x) , i = 1,2,... ,m. 

For the sake of simplicity, let us assume equal costs of 
misclassif ication . For given a priori probabilities ^ ' ^2 ' 

. . . ,q , ' it follows from [1. p. 142-143] that the Bayes discrimi- 
nant regions , R 2 ' * * * with respect to density functions 
p^"^ (x) , i = 1 , 2 , . . . ,m are given by 

R. = (x: I | B (x-y ) J - J Jb . (x-y . ) |>* log 2i i=l,2, . . . ,m, 

3 k=l lk 1 k=l 3 * 3 > i'/j , 

jl— = 1,2,. « . , m . } 

(4.4) 

Observe that the discriminant boundaries for regions 
. . . ,R are piecewise linear and can be distinctly found for 
given y's and £'s. . Since the integration of a density function 
p5^ (x) over any domain of linear (rectangular) planes can 
easily be evaluated, one should be able to obtain the probability 
of correctly classifying an observation x from population tt^, 

p(i[i) = / R pf 1J (x)dx , i = 1,2, ...,m. 

i 

And the probabilities of misclassifying an observation x from 

if . to another population -n . 

1 3 

p(j]i) = / pf 1 ^ (x)dx , j = 1,2,. ..,m, j ^ i 

j 


However, a Bayes region could be characterized by as many as 
2^different piecewise linear functions. This is because of so 
many different possibilities that exist for the values in deter- 
mining any inequality in (4.4). If q 1 =q 2 =. • the Bayes regions 

are given by 


R. 

J 


P p 

{X: J_J B ik (x_l, i ) l i zId (x-l-)|, i=l,2 m, 

k - 1 - 3k 3 (4.5) 

/ ^ / • • • / HI* 
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We now give an example to illustrate the algorithm 

involved in obtaining the "Bayes regions and P(j]i). 

Example 4.1 : • Suppose we have two populations and 

v t " T 

with mean vectors = (1^-2) , p 2 ~ and covariance 

matrices 




0 

16 


respectively. Then/ due to (2.1) 


B, = 

1/2 

0 ’ 

and B~ = 

1/3 

0 

1 

0 

1/3 

z 

0 

i/4_ 


For the sake of simplicity, assume = q^- Since 



-x'i-r 



’x x +r 

B 1 (x-u 1 ) = 

2 

x 2 +2 

and 

B 2 (x-y 2 > = 

3 

x 2 -l 


3 



4 


the two density functions are 

p<l> (x) = 1_ € -/2[1/2 |x 1 -1|+1/3|x 2 +2|] 

1 12 

p (l) (x) = 1 e -/2[l/3|x 1+ l| + l/4|x 2 -l|J 
2 24 

and the Bayes discriminant regions are determined by 


R ± = {x: l/2|x 1 -l|+l/3|x 2 +2i^l/3|x 1 +ll+l/4|x 2 -l| } 


and 

R 2 = {x: .1/2|x 1 -1|.+1/3|x 2 +2|>1/3|x 1 +1|+ l/4jx 2 -l| }. 
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Due to absolute sign, the two-dimensional Euclidean space 
is partitioned into 9 different regions and the Bayes dis 
criminant regions are given by 


where 




R ] 

R ] 

Hi 

R ] 

Ri 

R. 

R J 

R ] 

Ri 


11 = 

(x: 

2x^+X2+lf.O , 

x >1, X 2 > 1 -^ 

12 = 

(x: 

10x-^-X2“13^_0 

, -1<X 1 <1, X 2 > 1} 

13 = 

{x: 

2x- t -X2“21^_0 , 

x^<-l, 

14 

{x: 

2x^-7x 2 -15>_0 

, x,<-l, -2 <x 2 <1) 

15 = 

{x: 

10x 1 -7x 2 -7i.O 

, -1<x,<1, -2<x 2 <1) 

16 = 

(x: 

2x^-lx^-5<0 , 

x^>1, -2<x 2 < 1) 

v-> 

ii 

(x: 

2x^-X2~21<_0 , 

x^>1, X2 < -2) 

18 

(x: 

10x 1 +x 2 +9i.O , 

-1<x 1 < l , X2<-2> 

19 = 

{x: 

2x^+X2+l^_0 , 

x 1 <-1, X2<-2}, 

R h 

the 

complement of 

. Since R.^, R 12 f 


R 


and R^g are empty sets , 


R 1 “ R 15 U R 16 U R 17 U R 18 
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One can now evaluate the probabilities of correct 
classification and the probabilities of misclassification. 
For example, the probability of correct classifying an 
observation from ir^ is given by 

( 1 ) 


P(lll) = / p T j (x)dx 


R, 

x 1 10/7x, -1 1//2 ( Xl -l)-/2/3 (x 0 +2) 
= 1/12 / / 1 e 1 2 

-1 -2 


dx^dx^ 


» (5-2x )/7 -l//2(x -l)-/2/3(x +2) 

+ 1/12/ f 1 e 1 dx dx 

1-2 

“ -2 -l//2'(x,-l)+/2/3 (x„+2) 

+ 1/12/ / e 1 2 

1 2x -21 
1 -2 l//2(x..-l) + /2/3 (x~+2) 

+1/12 / / e • ■ dx dx 

-1 -lOx -9 

-/2 1 -17/2/21 -6/2/7 

= 1/4 (1-e -e +e )+ remaining three terms 


dx 2 dx^ 


Similarly, the probability of misclassifying an observation from 
r 2 is given by 

P (1 J 2) = / p ( ^ ( x ) dx 
R-, 


1 10/7x, -1 -/2/3 (x,+l) +1/2/2 (x 9 -l) 

= 1/12 J J 1 e 1 * 

“1 -2 

(5-2x y7 -/2/3(x +l)+l/2/2(x -1) 


dx 2 dx^ 


+ 1 / 12 / / 

1 -2 

» -2 -/2/3 (x+1) +1/2/2 (x 2 -l) 


dx 2 dx-^ 


+ 1/12/ / e 
^ 1 2x, -21 

1 -2 

+ 1/12 / / e 

-1 -lOx-9 


dx 2 dx^ 


-/2/3 (xjl+D+1/2/2 (x 2 -1) 


dx 2 dx-^ 


which can be easily evaluated. In a similar way one can find 
P(2J2), the probability of correctly classifying an observation 
from tt 2 and P{2jl), the probability of misclassifying an obser- 


1 * 


vation from 
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In example 4.1, we had zero covariance in the covariance 
matrices. In the following example we consider random vectors 
whose components are correlated, that is covariances are not 
zero o 

Example 4.2 Suppose there are populations and tt 2 with 

T T 

mean vectors y^ = (2,1) , y^ = (-1,-2) and covariance matrices 


Z 1 = 


13/4 -9/2 

-9/2 9 

Then due to (2.1) 


B 1 = 


E 2 = 


100/9 32/3 

32/3 16 


1 

1/2 

and 

B = 

1/2 

-1/3 

0 

1/3 


2 ' 

0 

1/4 


and so 


B 1 (x-y 1 ) = 


2x., +x^-5 
1 2 ’ ^ 
x 0 — 1 


b 2 ( x “M^) = 


3x -2x^-1 

x ? +2 
■ “^4 


Accordingly, the normed density functions associated with 
and tt ^ are given by 


( 1 ) 


(x) = 1/6 e 


-l/3/2(3j2x 1 +x 2 -5j+ 2|x 2 -l ) 


and 

(1) -1/6/2 (2| 3x 1 -2x 2 -lj+3]x 2 +2j ) 

p 2 ' (x) = 1/16 e 

3 _ _8 

Let - n» q 2 • Then from (4.4), the Bayes discriminant 

regions are obtained as 


R^ = (x: 2 j 3x 1 ~2x 2 -l J + 3 | x 2 +2 | >_6 | 2x 1 +x 2 ~5 | + 4 |x 2 ~l | ) 


and R^ is the complement of R^. 
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By simplying the inequality in , we obtain 

R 1 = K=1 R 1K 

where 


R 11 

= {xj 

6x2+llX2“38£0 , 2x-^+X2~5>_0, 3x^“ 2 x 2 “ 1^.0 , x 2 ^.l ) 

R 12 

= (x: 

2 x 2 +X 2 ~ 10 £ 0 , 2 x^+x 2"5>_0, 

3 x ^- 2 x 2 ~ 1^0 , - 2 <^x 2 f.l} 

R 13 

= (x: 

2x2+3x2~65_0 , 2x2+X2“5^_0, 

3x^- 2 x 2 ~ 1>0 , X 2 ^_- 2 } 

R 14 

= { x : 

6x2+X2“14^0, 2x2+X2“5^_0, 

3x^-2x 2 -ll.O , 

R 15 

= { x : 

2x2+3x2~6^_0, 2x2+X2-5^0, 

3x 1 ~2x2-ll.O , x 2 ^l) 

R 16 

=* Cx: 

I 8 X 2 +X 2-22 >_0 , 2x2+x 2 “5<_0 i 

3x^-2x2“1^.0 , X 2 > 1 ) 

R 17 

= (x: 

6x2+3x2~10^_0 , 2 x2+x 2 “5£0 j 

3x^ - 2 x 2 - 1^.0 , _ 2 <^X 2 ^_ 1 ) 

CO 

cH 

= { x : 

6x2+X2-14^0, 2x2+X2~5£0, 

3x 1 -2x 2 ~1^.0 , X 2 f.- 2 ) 

Infact, we can 
/ 

4 

write where 


A 1 “ 

V 

R 16 

• 

s= 

(x: 

6x^+11x 2 ~38<_0 , 3x2-2x2~1^0 

, 18x^+X2"22>_0 , X 2 ^ 1 ) 

a 2 “ 

R 14 U 

R 15 


= 

{x: 

6x 2+X2~14<_0 , 3x2-2x2 _ lf.O , 

2x-^+3x2 _ 6^_0 , X 2 ^ 1 } 

A 3 " 

r 12 0 

R 17 


’ = 

(x: 

2 k 1 +X 2~ 10 — 0 ' 3 x 1 “ 2 x 2 ~ 1 1 0 ' 

6x 1 +3x 2 -10>0 , - 2 <_x 2 ^_ 1 } 

A 4 " 

R 13 u 

R 18 


= 

{x: 

2x^+3x2*'6<_0 , 3x^-2x2~1^.0 , 

6x^+x 2 “14>0, X 2 ^_- 2 } 
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Though each of the regions R^ and consists of more than one 
subregion , these are piecewise linear and probabilities of 
classification P(i|i) and P(i]j), i and j = 1,2, can be easily 
evaluated. 

4.2. Populations with density functions p ^ (x) , i = l,2,...,m. 

For given a priori probabilities ' * ' * Ba ^ es 

discriminant regions are given by 


R j = {x: K=??2 p C | ) 


max 


( B 


> (- 


3 


l/(P+2) 

pWTJT’ log i=l,2,. 


K=l,2,...,p 1 jK 


,m; i^j), (4.6) 


3 1 , 2 , o . . ,m. 


If < 3p“ c 32 =::: ** ( 4 . 6 ) is reduced to 

R j = {X: K=l,2,. . . f p*l B iK* X ~ , V' ^-K=l,2,. . . ,p^ B jK (x_M j ) J } 


l — 1 , 2 , ... ,m; 17^3 ^ , 


(4.7) 


3 — 1,2, . . . , m . 

* Again these discriminant regions are described by piecewise 
linear functions and once determined these lead to an exact evalua 
tion of probabilities of classification P ( i 1 i ) and P(j|i), i and 
3~1 , 2 , • • • / in • 

Example 4.3 . Consider the populations in Example 4.2 with the 
density functions 

p(“) ( x ) _ _1_ e "l/3 ^/2 max (3 J 2x 1 +x 2 -5 j , 2|x 2 ~lj) 


and 
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(») . . 1 -1/6 4 /2‘ max(2]3x 1 -2x -l| , 3|x 0 +2|)' 

P 2 . (x) = S 72" e 1 2 2 

For qj_ =( l 2 ' t ^ ie Ba Y es discriminant regions are 

Rj, = {x: max (2 j 3x^-2x 2 ~l j , 3 | x^+ 2 [ ) >max ( 6 J 2x^+x 2 ~5 ] , 4 | x^-l | j ) 


and R 2 is the complement of R^. After the possibilities for the 
inequality are considered we obtain 





where 


A^ = (x: 2 | 3x- L ~2x 2 ~l | >6 | 2x^+x 2 ~5 j , 2 | 3x^-2x 2 ~l | ^3 i x 2 +2 ] , 

6 |2x^+x 2 ~5 | >_4 | x 2 ~l | } 

A 2 = {x: 2 | 3x^-2x 2 -l | >_4 | x 2 ~l | , 2 ] 3x^-2x 2 ~l | >_3 | x 2 +2 ] , 

6 | 2x-^+x 2 -5 I £4 | x 2 ~l | } 

A^ = (x: 3 I x 2 +2 ] >_6 | 2x^+x 2 ~5 | , 2 | 3x^-2x 2 ~l ] <_3 | x 2 +2 | , 

6 | 2x 1 +x 2 ~5 j >_4 | x 2 ~l | } 

A^ = {x: 3 ]x 2 +2 | >_4 j x 2 ~l | , 2 | 3x^-2x 2 ~l j <3 j x 2 +2 | , 

6 1 2x^+x 2 ~5 j <_4 ] x 2 ~l J } 

By solving the inequalities in and A- 4 for different 

possible cases, we obtain R^ and R 2 as shown in figure 3.3. 

We omit the evaluation of probabilities of classification 
and let the interested readers carry out these computations. 
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Figure 3.3: Bayes discriminant regions R. and R_ for popula- 
tions and tt 2 « 1 . 
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5 . ' Computational Aspects 

Apart from a theoretical solution that the problem of classi- 
fication has, the number of computations involved in the algorithm 

are also reduced when the density functions pj^(x) and p|°°^(x) are 
• ( 2 ) 

considered instead of pj^ (x) , i = l,2,...,m, as population models. 
Its explanation derives from the elimination process for the quad- 
ratic form 



tx - y i ) T zT 1 (x - y ± ) 


( 2 ) 

that needs to be computed when finding an estimate of p^ ( x) , 
i = 1,2,..., m. To give an idea of computations associated with the 
evaluation of Q 2 , consider the following examples. 



T 

3, then a quadratic form Q = X AX can be 
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where 

b. s= x, a, . + x„a~. + x-.a 0 .. 

1 • 1 lx 2 2x 3 3x 

T 

In the first multiplication of X A reguires 3 multiplications and 

3-1 additions performed 3 times to obtain the vector b = 

[b^,b 2 >b^]. The second multiplication bX requires 3 multiplications 

and 3 - 1 additions. , Hence the total number of multiplications is 
o 

3*3+3 =3+3 and the total number of additions are 3(3 - 1) + 

(3-1) =(3+1) (3-1) = 3 2 - 1. Inductively one can deduce the 
2 2 

formulas, p + p multiplications and p - 1 additions are required 

I 

to compute a value for the quadratic Q. 

Example 5.2 . Let p = 3, A = {a..} be symmetric then 


Q = [X x x 2 x 3 ] 

a ll 0 

0 


X 1 



2a 21 2a 31 

0 


X 2 


• 

2a 31 2a 32 

a 33 


X 3 







k 

= t x ^ a ^l + 2 x 2' 

a 21 + 2 X 3 a 31 ' 

x 2 a 22 

+ 2x 3 a 32 , x 3 a 33 ] 1 



= 4*1 + h 2 x 2 + b 3 X 3 

where 


b - = 


X l a ll + 2x 2 a 21 + 2x 3 a 31 
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b 2 “ X 2 a 22 + 2x 3 a 32 
b 3 • = X 3 a 33 * 

In the first multiplication of X [ requires 3+(3-l)+(3~2) 

=3+2+1 multiplications and (3-1) +(3-2) additions to 

obtain the vector b' = [b^, b b^]. The second multiplication 

b’X requires 3 multiplications and 3-1 additions. Hence, the total 

number of multiplications is 3+ (3- 1) + (3- 2) +3; and the total 

number of additions is (3 - 1) + (3 - 2) + (3 - 1). Inductively 

one can deduce the formulas p + (p - 1) . . .+ 1 + p = (p(p+l))/2 + p 
2 

= (p +3p)/2 multiplications and (p - 1) + (p - 2) + ... + 1 + (p - 1) 
« (p - l)/2 + p - 1 = (p 2 + p - 2)/2 additions. Note that if one 
takes advantage of the symmetry property of the matrix then the 
savings A^ and A a in the number of multiplications and additions 
are given by 

A m = (p 2 - p)/2 

a a = (p2 " p)/2 * 

Also, note that when = (x - y^) then one must add an additional 

p additions; hence in order to compute in which symmetry is 

2 

exploited the number of multiplications remains (p + 3p)/2, but 

the number of additions is increased by p additional additions 

2 2 

making a total of p + (p + p - 2)/2 = (p + 3p - 2)/2 additions. 

Recall that the value of m is the number of populations from 
one of which the individual generating the measurement X might come. 
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Hence to perform the computation to accomplish a single discriminant 

2 2 
task will require (p + 3p)/2 multiplications and (p + 3p - 2)/2 

additions for each Q. , i = 1,2,. ..,m, when A is symmetric. The 

decision to classify follows after ordering m positive numbers, 

p^(x), i = 1,2,... ,m. Then 

t Q (p,m) = m 

2 

t M (p,m) = m(p + 3p)/2 

and 

t A (p,m) = m(p 2 + 3p - 2)/2 

where t Q (p,m), t M (p,m), and t^(p, m) ; the number of orderings, 
multiplications, and additions (per resolution element), respective- 
ly. Since these operations must be repeated a very large number of 
times they should be performed using the computer assembly language 
upon that computer being used instead of a general language sued as 
FORTRAN, etc. These values for known applications are not extra- 
ordinarily large, yet when put into a remote sensing application 
total time becomes significantly large. 

If we denote by Tq , T^, and T^ the total number of orderings, 
multiplications and additions, respectively, per image, then 

T 0 (p,m,r,c) = mrc 

T M (p,m,r,c) = mrc(p 2 + 3p)/2 
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and 



T A (p,ra,r,c) 


= mrc (p + 3p - 2)/2 


To illustrate what size of values T A , T.. , and T* can take, 

0 M A 

3 

consider the following "real" data where r = 10 , c = 200 , p = 5, 
and m = 10 , then 

T Q = 2 • 10 6 

T M = 4 • 10 7 

M 

T a = 3.8 • 10 7 


Clearly these are large numbers of operations to be performed, and 
when one realizes that if this image represents only approximately 
a 2 mile by 10 mile strip of the earth's surface, and that it is 
proposed by space scientists that complete earth surveys be per- 
formed by remote sensing techniques, the size of the computation 
task indeed is large. 

Clearly, it becomes important to investigate schemes which 
will reduce significantly the size of the remote sensing problem. 
Since r, c and m are not in the strictest sense arbitrary, there 
appears only one parameter, the value of p, which might be reduced. 
Through techniques call characteristic selection [11], [12], [13] 

and data compression [14] one can reduce the value of p and hope- 
fully maintain approximately the optimality of the Bayes Procedure 
for discriminating. A second technique developed heuristically [15] 
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by computer scientists have proved successful in several empirical 
cases and can be considered a close approximation to a Bayes or 
optimal procedui'e . This technique is one which has "traded off" 
floating point addition and multiplication for an integer addition, 
in a table look up computer operation, thereby reducing the time for 
computing from 2 units to .066 units in one empirical example [14]. 
This technique has come to be known as the table look-up discrimi- 
nant technique. ' • 

Further savings in computing operations can be achieved by using 
the new models proposed in this paper. In the case of pj^(x) in 
(4.2), one needs to evaluate the linear form 

Q i = l,B i (x - M'i 

2 

It can be seen that this requires p(p + l)/2 = (p + p)/2 multipli- 
cations and p+p(p- l)/2 + (p - 1) = (p 2 + 3p- 2)/2 additions. 

When compared to computations involved for Q^, there is a saving of 
p multiplications but no saving in the number of additions. In 
most cases any such saving may not be of significance. But we 
should check to see if in the remote sensing application the 
saving will be significant. If we denote T^, T^, and T^ the total 
number of orderings, multiplications and additions, respectively, 
per image, when (4.2) is used, then 

T^(p,m,r ,c) = mrc 

2 

T^(p,m,r ,c) = mrc (p + p)/2 



- 68 - 




T^(p,m,r ,c) 

= mrc 

[ ( p 2 + 3p 

- 2)/2] 
i 


In our exampl 

e where 

r = 10 3 , c = 

200, p 

= 5 , and m = 10 , 

the 

numbers are 







T » 
0 

= 2 . 

10 6 

i 

= 2 . 

t. 

10 6 = 

T o 

% 

= 30 

.* 10 6 *= 3.0 

• 10 7 

< 4 • 

10 7 = 

t h 

T k 

= 38 

• 10 6 = 3.8 

• 10 7 

= 3.8 • 

10 7 = 

t a 


Note that there exists a savings of 3 to 4 in multiplications by 
using the probability density function as defined by (4.2) instead 
of the normal probability density functions in this example. 

Next, for the density function p^ ' ( x) in (4.3), an evaluation 
of 

Q m = llB.(x - y.)l! ’ 

a> X X 00 

is needed. What this will do compared to is that (p-1) additions 
will be eliminated at the cost of an ordering operation to determine 

max { |B ik ( x - y ± ) | } 

; i=l , 2. . . ,p 

and will not effect any change in the number of multiplications. 

If Tq, T^, and denote the total number of orderings, multiplica- 
tions and additions, respectively, per image, when (4.3) is used. 


then 
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T" (p,m,r,c) = 

(p,m,r,c) = 

(p,m,r,c) = 

For the numerical example where r = 
we have 


■-3 
© = 

ll 

r-'- 

o 

i — i 


m ll 

M 

3.0 

• 10 

m ll 

A 

3.0 

. 10 


mrcp 

2 

mrc(p + p)/2 
2 

mrc(p + p)/2 

3 

10 , c = 200, p = 5, and m = 10, 

> 2 • 10 6 = T 0 

< 4 • 10 7 = T 

M 

< 3.8 o 10 7 = T 

A 


This leads to a saving of approximately 3 to 4 in total number of 
computing operations, but has increased the number of orderings by 
a multiple of 5. 

It may be observed from these examples that the non-zero co- 
variances in the covariance matrices Y. , i = l,2,...,m, are the 
sources of our computational problems. It certainly would be de- 
sirable to select those characteristics which will be uncorrelated 
and yet discriminate well. 


6 . Concluding Remarks 

There are several facets of our discussion in this paper. 

First it may be noted that we have considered transformed random 
variables obtained by linear combinations of components of a random 
vector. The linear combinations depend upon the covariance matrix 


i 
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of the random vector and are therefore not arbitrary. It is some- 
time desirable to have the data transformed in some suitable way 
so that a consequent analysis becomes more relevant and useful. 

For example, the technique of transforming variables is well exploit- 
ed in regression analysis for making regression more nearly linear 
and, possibly, random variables distributed more nearly normal. 

Next, in his discussion on the problem of principal components , 
Anderson [1, chapter 11] has cited many advantages that the linear 
transformation of random variables has. It is' therefore our hope 
that this paper furthers such a cause and that the consideration 
of the proposed normed exponential density functions lea&s to some 
kind of break in the "stalemate" which the Bayes classification 
problem has reached in the case of normal density functions with 
unequal covariance matrices regarding the evaluation of probabilities 
of classification. 

Though we have discussed only the problem of discriminant ana- 
lysis with respect to the normed exponential density functions as 
models, the ideei is sufficiently general and perhaps other problems 
of the multivariate analysis theory can be treated and solved by 
using these density functions. 
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ABSTRACT 


This report discusses dynamic programming and cluster analysis. 
A dynamic programming technique which yields an optimal partition 
is motivated and discussed and its relevance to data of the magni- 
tude of remote sensing data is noted. 
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1. Introduction . 

The utilization of discriminant analysis in the remote sensing 
application has been widely discussed; for example see [6] and [7]. 
The more basic subject- of cluster analysis has also been discussed 
in relation to remote sensing data [2], [4], [5]. Cluster analysis 
is more basic in that the number of classes (populations) is not 
assumed known but is determined, in general, as part of the solu- 
tion. 

A particular cluster analysis technique used in the remote 
sensing data situation is Ball and Hall’s [1] well known ISODATA 
technique. This procedure is well documented and its use, in- 
cluding various modifications of the original procedure, are dis- 
cussed in [5] . The ISODATA procedure is an iterative procedure 
which has received very wide acceptance. 

The cluster problem can be viewed as that of partitioning 
objects into m subsets such that objects within each subset are 


This research was supported (in part) by NASA Manned Spacecraft 
Center under Contract NAS9-12775. 

This research was supported (in part) by NASA Manned Spacecraft 
Center under Contract NAS9-11925. 



-76- 


"similar" and subjects between subsets are "dissimilar". The 
objective in solving the cluster problem is then to determine the 
optimal partitioning such that a certain criterion of homogeneity 
within clusters is satisfied. One way of accomplishing this 
objective is by complete enumeration, i.e. examine the homogeneity 
criterion for all possible partitions into m clusters and choose 
that one which is optimal. Unfortunately, the method of complete 
enumeration is in general impractical, even for small values of 
n and m. 

One- alternative to the complete enumeration technique is to 
utilize some of the techniques popularly called dynamic program- 
ming techniques in an attempt to reduce the amount of computation 
but yet converge on the optimal solution. Many techniques, such 
as hierarchichal techniques, search for the optimal solution in 
a class of subsets (clusters) and the optimal solution over the 
whole class of clusters is not guaranteed. The aim here is to 

i 

motivate and discuss a dynamic programming scheme of Jensen [3]. 

2 • Application of Dynamic Programming to the Cluster Problem . 

In this section we consider the problem of partitioning 
a set of 6 objects into 3 subsets when the distance between two 
objects is the Eulidean metric or the criterion is the minimiza- 
tion of Within Groups Sum of Squares (WGSS) . 

Recall that WGSS is given by • 

m m 

W = tr l s = l W. 

3=1 3 3=1 3 
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where S . denotes the n x n scatter matrix for the j cluster 
1 J 

and tr = W_. . Equivalently, we have 



where d 2 (X . ,X . ) = (X.-X.) T (X.-X.). 

11 i 1 il 

The purpose of a dynamic programming scheme is to systema- 
tically search for groupings which yield minimum values of the 
quantity W, eliminating those groupings which do not yield mini- 
mum values of W and also those that are redundant. 

We now discuss. the problem of partitioning n = 6 objects 
into m = 3' subsets by complete enumeration. This will serve to 
motivate a dynamic programming scheme for the cluster problem 
given by Jensen [ 3] . 

The total number of ways of partitioning 6 objects into 3 
subsets is given by . 



The 90 clustering alternatives can be classified according 
to their distribution forms [ 3] . The three distribution forms 
in this case are denoted by 
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(i) 

• { 4 } 

{ 1 } 

£ 1) / 

(ii) 

{3} 

{2} 

U), 

(iii) 

{2} 

{2} 

{2} , 


v;here each of the components in a distribution form {i} denotes 
the number, i, of objects in the corresponding cluster. The com- 
ponents of a distribution form will always be written in descending 
order. In our example there are 90 clustering alternatives but 
only 3 distribution forms. In general the number of distribution 
forms is substantially smaller than the number of clustering 
alternatives . 

<*><!> 

There are = 15 clustering alternatives 

2 

6 3 

corresponding to the distribution form {4}, { 1 } , {1}; ( 3 )^) = 60 

clustering alternatives corresponding to {3}, {2}, {1}; and 

< 2 ) ( 3 ) "( 2 ) /3 i = 15 clustering alternatives corresponding to {2}, 

{'2), {2}. The clustering alternatives corresponding to each 


distribution form are, now 

listed 


Distribution Form 

{4} 

, (I), 

{1} 

H 

** 

to 

«•* 

3 , 

4) , 

(5) , 

( 6 ) 

(1, 2, 

3, 

5) , 

(4) , 

( 6 ) 

(1/ 2, 

5, 

4), 

(3) , 

( 6 ) 

(1, 5, 

3, 

4 ) / 

( 2 ) , 

( 6 ) 

(5/ 2, 

3, 

4) , 

( 1 ) , 

( 6 ) 

dr 2, 

3, 

6 ) , 

(5) , 

(4) 
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dr 

2, 

6, 

4), (5), 

(3) 

dr 

6, 

3, 

4), (5 ) , 

(2) 

(6, 

2, 

3, 

4) , (5) , 

d) 

(lr 

2, 

5, 

6), (3), 

(4) 

dr 

5, 

6 , 

4), (2 ) , 

(3) 

(5, 

6 , 

3, 

4), d)r 

(2) 

(5r 

2, 

3 r 

r-f 

** 

(4) 

dr 

5, 

3, 

6), (2), 

(4) 

( 5 r 

2, 

6, 

4), ( 1 ) r 

" (D 


Distributicn Form {3}, { 2 }, { 1 } : 


dr 

2, 

3 ) r 

(4, 5), 

(6) 

dr 

4 r 

5) r 

(2, 3), 

(6) 

(-D 

2 r 

3) r 

( 4 r 6), 

(5) 

dr 

4 r 

5) r 

(2, 6) , 

(3) 

dr 

2, 

3), 

(5, 6), 

(4) 

dr 

4 r 

5) , 

(3, 6) , 

(2) 

dr 

2, 

4 ) r 

(3, 5), 

(6) 

dr- 

4, 

6) , 

(2, 3) , 

(5) 

(lr 

2, 

4) r 

(3, 6), 

(5) 

dr 

4, 

6) , 

(2, 5) , 

(3) 

dr 

2, 

4) , 

( 5 r 6) , 

(3) 

(lr 

4, 

6) r 

(3, 5) , 

(2) 

dr 

2 r 

5) , 

(4, 3), 

(6) 

(1, 

5, 

6) , 

(2, 3) , 

(4) 

dr 

2, 

5) r 

(4 r 6) r 

(3) 

d, 

5 r 

6) , 

(2, 4), 

(3) 

dr 

2 r 

5) r 

(6, 3) r 

(4) 

(lr 

'5, 

6) r 

( 3 r 4), 

(2) 

dr 

2 r 

6) r 

(4, 5), 

(3) 

(2, 

4 r 

5) r 

dr 3), 

(6) 

dr 

2, 

6) r 

( 3 / 5) , 

(4) 

(2, 

4 r 

5) r 

dr 6) , 

(3) 

(lr 

2 r 

6) r 

( 3 r 4), 

(5) 

(2, 

4, 

5) r 

(3, 6), 

(1) 

dr 

4 r 

3) r 

( 2 r 5)r 

(6) 

( 2 r 

4, 

6) r 

dr 3) , 

(5) 

dr 

4 f 

3) , 

(2, 6), 

(5) 

• ( 2 r 

4 r 

6) r 

dr 5) , 

(3) 

dr 

4, 

3) r 

(5, 6) # 

(2) 

(2 » 

4, 

6) r 

(3, 5), 

d) 

(lr 

5, 

3) , 

(4, 2), 

(6) 

(2, 

5r 

6) r 

dr 3), 

(4) 
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(1, 

5, 

3) , 

(4 / 

6) , 

(2) 

(2, 

5, 

6) , 

dr 4), 

(3) 

(lr 

5, 

3) , 

(2 r 

6) , 

(4) 

(2, 

5r 

6) , 

(3 / 4), 

d) 

dr 

6, 

3) , 

(4, 

5) , 

(2) 

( 3 / 

4 r 

5) , 

dr 2), 

(6) 

dr 

6r 

3) , 

( 4 , 

2) , 

(5) 

' (3, 

4 r 

5) , 

dr 6), 

(2) 

dr 

6, 

3) , 

(2, 

5) , 

(4) 

(3, 

4, 

5) , 

(2, 6), 

d) 

(4, 

2r 

3) , 

d, 

5) , 

(6) 

(3 / 

4, 

6) , 

d, 2), 

(5) 

(4, 

2, 

3) , 

dr 

6) , 

(5) 

(3, 

4, 

6) r 

d, 5), 

(2) 

(4 / 

2, 

3) , 

( 5 / 

6) , 

d) 

(3, 

4, 

6) r 

(2 , 5) , 

d) 

(5, 

2, 

3) , 

( 4 / 

1) r 

(6) 

(3, 

5r 

6)r 

dr 2), 

(4) 

(5, 

2, 

3) , 

(4, 

6) , 

(1) 

(3/ 

5, 

6) , 

d, 4), 

(2) 

(5, 

2, 

3) , 

dr 

6) , 

(4) 

(3 / 

5, 

6) , 

(2, 4) , 

d) 

(6, 

2 r 

3) , 

(4 / 

5) r 

d) 

(4, 

5, 

.6) , 

dr 2), 

(3) 

(6, 

2 r 

3) , 

(4 / 

1) , 

(5) 

(4 , 

5r 

6) r 

dr 3) , 

(2) 

(6, 

2, 

3 ) r 

dr 

5) r 

(4) 

(4 / 

5r 

6) r 

(2, 3) , 

d) 


Distribution Form {2}, { 2 }, 

{2}: 





d, 

2) , 

(3 r 

4) r 

(5, 

6) 






dr 

2), 

(3, 

5) , 

(4, 

6) 






dr 

2) , 

(3, 

6) r 

(4, 

5) 






dr 

3), 

(2, 

4 ) r 

(5, 

6) 






dr 

3) , 

(2, 

5) r 

(4, 

6) 






dr 

3) r 

(2, 

6), 

(4, 

5) 






(lr 

4) r 

(2, 

3 ) r 

(5, 

6) 

• 





dr 

4) , 

(2, 

5)r 

(3, 

6) 






dr 

4) , 

(2, 

6),, 

(3, 

5) 






dr 

5) , 

(3, 

4) , 

( 2 / 

6) 
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(1, 

5) , 

(3, 

2) , 

(4, 

6) 

(1, 

5) , 

(3, 

6) , 

(2, 

4) 

(1, 

6) , 

(3, 

4) , 

(3, 

2) 

(1, 

6) , 

(3, 

5) , 

(4, 

2) 

(1, 

6) , 

(3, 

2) , 

(4, 

5) 


under complete enumeration the objective function W(WGSS) 

I 

would need to be evaluated for each of the 90 clustering alterna- 
tives given above and that clustering alternative chosen for 
which W is a minimum. One notes from the list of clustering al- 
ternatives that under complete enumeration the WGSS would be 
calculated more than once for some of the clusters, for example, 
the cluster (1, 2, 3). 

A dynamic programming scheme applied to the cluster problem 
is a scheme which works for the optimum grouping in stages such 
that at each stage the objective function is computed in such a 
way that redundant calculations inherent in the complete enumera- 
tion procedure are eliminated. In this way, the optimal solution 
will be attained in stages. The dynamic programming approach will 
require large amounts of rapid access storage. 

The above example can be put into the framework of a dynamic 
programming solution as follows. The clustering alternatives 
are first classified according to their distribution forms. Re- 
call that the distribution form components are listed in descending 
order. At the first stage the objective function for each cluster 
corresponding to the first distribution form component is evalua- 
ted and saved. At the second stage the objective function for 
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the clusters corresponding to the first two components of the dis- 
tribution forms is evaluated using all information from the first 
stage, that is, the within sum of squares is not recomputed for 
any cluster but "carried over" from the first stage. 

For a discussion of the dynamic programming approach consider 
Table 3.1. The second column gives the clusters corresponding 
to the first component of the distribution forms, that is, the 
clusters available for the first stage. The number of clusters 
for the first stage is (^j + j^J + ^ = 50. The function W 
will be computed for each of the fifty clusters in stage 1. At 
the second- stage we will have 2 clusters corresponding to the 
first two components of the distribution forms, that is, we will 
have clusters of size {4} and {1}, {3} and {2}, or {2} and {2}. 

Thus the total number of objects at stage 2 will be 5 or 4. The 
number of ways of obtaining 5 objects is given by + ^ 3 ^ ( 2 ) = 

The number of ways of obtaining 4 objects is (^J (^\ + = 150. 

The total number of ways of obtaining objects in stage 2 is there- 
fore 240. However, there are ~ (^j (^j = 45 ways to form clusters 
giving rise to distribution form components { 2 } { 2 } at stage 2 , 
that is, there are 45 redundancies. Also, for components {3} {1} 
at the second stage it will be necessary to add 2 entities at 
the third stage giving rise to distribution form {3} {1} {2} which 
is ultimately equivalent to form {3} {2} {1}. Thus, the total 
number of ways of obtaining entities for the second stage is 





= 30 + 60 + 45 = 135 


90 . 
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Stage 0 


1 . ( ) 


TABLE 3.1 


St acre 1 


Stage 2 


Stage 3 


1 . 

( 1 , 

2, 

3, 

4) 







2. 

( 1 , 

2, 

3," 

5) 







3 . 

( 1 , 

2, 

5, 

4) 







4. 

(1, 

5, 

3, 

4) 







5. 

(5, 

2 ; 

3, 

4) 







6. 

( 1 , 

2, 

3, 

6) 







7. 

( 1 / 

2, 

6, 

4) 







8. 

( 1 , 

6 , 

3, 

4) 







9. 

(6, 

2, 

3, 

4) 







10. 

(1, 

2, 

5 , 

6) 







11. 

( 1 , 

5, 

6, 

4) 







12. 

(5, 

6 , 

3, 

4) 







13. 

(5, 

2, 

3, 

4) 







14. 

( 1 , 

5, 

3, 

6) 







15.. 

(5, 

2, 

6 , 

4) 

1. 

(1, 

2, 

3, 

4, 

5) 

16. 

( 1 , 

2, 

3) 


2. 

( 1 , 

2, 

3, 

4, 

6) 

17. 

( 1 , 

2, 

4) 


3. 

(1, 

2, 

3, 

5, 

6) 

18. 

( 1 , 

2, 

5) 


4. 

( 1 , 

2, 

4, 

5, 

6) 

19. 

( 1 , 

2, 

6) 


5. 

( 1 , 

2, 

3, 

5, 

6) 

20. 

( 1 , 

3, 

4) 


6. 

(2, 

3, 

4, 

5, 

6) 

21. 

(1, 

3, 

5) 


7. 

(1, 

2, 

3 / 

4) 


22. 

(1, 

3, 

6) 


8. 

(1, 

2, 

3, 

5) 


23. 

(1, 

4, 

5) 


9. 

(1, 

2, 

3, 

6) 


24. 

(1, 

4, 

6) 


10. 

(1, 

2, 

4, 

5) 


25. 

(1, 

5, 

6) 


11. 

(1, 

2, 

4 / 

6) 


3*. 

(2, 

3, 

4) 


12. 

(1, 

2, 

5, 

6) 


27. 

(2, 

3, 

5) 


13. 

(1, 

3, 

4,. 

5) 


28. 

(2, 

3, 

6) 


14. 

(1, 

3, 

4, 

6) 


29. 

(2, 

4, 

5) 


15. 

U,. 

3, 

5, 

6) 


30. 

(2, 

4, 

6) 


16. 

(1, 

4, 

5, 

6) 


31. 

(2, 

5, 

6) 


17. 

(2, 

3, 

4, 

5) 


32. 

(3, 

4, 

5) 


18. 

(2, 

3, 

4, 

6) 


33. 

(3, 

4, 

6) 


19. 

(2, 

3, 

5, 

6) 


34. 

(3, 

5, 

6) 


20. 

(2, 

4, 

5, 

6) 


35. 

(4, 

5, 

6) 


21. 

(3, 

4, 

5, 

6) 


36. 

(1, 

2) 









37. 

(1, 

3) 









38. 

(1, 

4) 









39. 

(1/ 

5) 









40. 

(1/ 

6) 









41. 

(2, 

3) 


1 







42. 

(2, 

4) 









43, 

(2, 

5) 









44. 

(2, 

6) 




• 





45. 

(3, 

4) 









46. 

(3, 

5) 









47. 

(3, 

6) 









48. 

(4, 

5) 









49. 

(4, 

6) 









50. 

(5, 

6) 










1 . ( 1 , 2 , 3 , 4 , 5 , 6 ) 
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a reduction of 105. 

The number of distinct sets containing either 4 or 5 objects 
for stage 2 is ^ = 21. These are listed under stage 2 

in the table 3.1 and are called states. Thus there are 21 states 
in stage 2. There were 50 states in stage 1. Five of the 135 
feasible ways of, obtaining states in stage 2 are indicated in 
Table 3.1. 

The final stage of the process is stage 3. The firial stage 
will result in 3 clusters. There is only one state in the final 
stage, the one involving all six objects. The number of ways 
of arriving at the six objects in the final state is 




=6+15 =21, 


that is, there are 21 feasible arcs from stage 2 to stage 3. 

For the example with n = 6 and m = 3 there are a total of 
135 + 21 = 156 feasible arcs. If one includes the number of 
initial states then there are 156 + 50 = 206 feasible arcs. Each 
feasible arc results in what is called a transitional calculation 
defined by 


( 2 ) 


" n 


l 

k i<jeg k 


a 2 . 

13 


where g k denotes a group of n^ objects and d^ the distance be- 
tween X . and X . . 

i 3 

The total enumeration procedure involves 90 clustering al- 
ternatives and 3 transitional calculations for each alternative 
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resulting in a total of 270 transitional calculations. The 
dynamic programming approach involves 206 or 64 fewer transi- 
tional calculations. 

Under dynamic programming suppose there exists a state, at 
some stage k, 'containing objects X^,...,X , q £ n. The dynamic 
programming procedure stores in memory the optimal way to parti- 
tion the q objects in k nonempty and mutually exclusive subsets. 
In later stages in which the q objects are partitioned into k 
subsets it is not necessary to recompute all feasible ways of 
performing the partitioning. 

As an illustration consider our example with n = 6, m = 3. 

Table 3.2 


Alternative 

1 

2 

3 


Transitional Calculations 
T (1 , 2) + T(3,- 4) + T(5, 6) 

T (1 , 3) + T ( 2 , 4) + T (5 , 6) 

T (1 , 4) + T(2, 3) + T (5 , 6) 


Recall that when n = 6 and m = 3 there arc S(6, 3 ) =90 unique 
clustering alternatives available. Three of these are listed in 
Table 3.2. Under complete enumeration 9 transitional computa- 
tions would be required for these 3 alternatives. Under dynamic 
programming it would take 6 transitional computations for the 
optimal partition of (1, 2, 3, 4) into two groups of size 2. The 
optimal partition, say T(l, 3) + T(2, 4) = V^d, 2, 3, 4) is 
recorded in memory so that only one additional computation is 
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required to determine ( 1 , 2, 3, 4) + T(5, 6). For these 3 
alternatives dynamic programming has eliminated 9-7=2 re- 
dundant calculations . Actually, as n and m are increased the 
number of redundant arcs that are eliminated is substantial, 
however, relative to the total number of transitional calculations 
the difference may not be so great. 

3. Jensen's Dynamic Programming Model . 

There is no standard mathematical formulation for the dynamic 
programming problem. This is in contrast to the linear program- 
ming problem for which there does exist a precise standard for- 
mulation. The equations and formulas pertinent to a dynamic pro- 
gramming problem depend on the particular situation at hand. The 
problem is usually reduced to a recursive relationship or equation 
which reflects the multiple interrelated decisions inherent in 
the dynamic programming procedure and which result in the final 
"optimal" result. 

Jensen's dynamic programming formulation is given in terms 
of the recursive equation 

0 if k = 0 , 

(3) W k (z) ;: = < 

min [T(z-y) + W k _ 1 (y)] / if k = 1, 2,...,m 0 . 

y u 

i 

where 

m = number of disjoint and non-empty subsets into 
which the n objects are to be partitioned. 
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k = index or stage variable, 

m Q = m if n >_ m, and n - m if n < m, 

z = state variable representing a given set of 
objects at stage k, 

y = state variable representing a given set of 
objects at stage k-1, 

z - y = subset of all objects contained in z but 
not in y, 

T(x-y) = is the "transition cost" of the objects in the 
cluster of objects in (z-y) . 

The variables y and z represent 2 states (sets of objects) in 
stages k-1 and k, respectively. The difference z-y represents 
those objects contained in the stage k state but not in the stage 
k-1 state. T(z-y) then represents the ''transition cost" or WGSS 
for those objects which are combined with the stage k-1 state 
objects and W^(z) = min (T(z-y) + (y) ] gives the minimum value 

for WGSS in partitioning the objects represented by z into k dis- 
joint and nonempty subsets. It will be seen that the use of formula 
(3) calls for a substantial amount of bookkeeping. Recall from 
section 2, e.g. (2) that if g^ denotes a cluster of n^ objects 
then the transition cost T(g^) is given by 


T ig L ) = 


- I 

i k< jeg ± 


*j 
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which is actually the WGSS for cluster g^. 

Note that the number of stages is = m. if n > m, and 
n - m if n < 2m. The reason for this is that if n < 2m there must 
always be at least n - m + 1 single-object clusters. The transi- 
tion cost T for a single-object cluster is 0 so that single object 
clusters add nothing to W. Consequently, the process may be ter- 
minated at stage m^ and all remaining clusters are assumed to be 
single-object clusters. Also, in computing W^(z) it shduld be 
emphasized that the objects corresponding to any state in stage k 
consist of objects contained in some set corresponding to some 
state y of stage k - 1 and objects contained in another set re- 
presented by z - y. 

As an example to illustrate the notions involved in the re- 
cursive equation (3) consider state 37 of stage 1 and state 15 of 
stage 2 when n = 6 and m = 3 (Table 3.1). In this- case y represents 
the objects (1, 3) in stage 37 of stage 1, z represents the objects 
(1, 3, 5, 6) in state 15 of stage 2, and z - y represents the 
objects (5, 6). The "transition cost" from stage 37 to state 15 
is then 

T(z - y) = T ( 5 , 6) = d^ 6 

The transition cost from state 37 in stage 1 to state 1 in stage 
2 would be 

,2 ,2 ^ ,2 

d 24 + d 25 + d 45 


T (z - y) = T (2 , 4, 5) = 


3 
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At the first stage the dynamic programming algorithm 
considers the evaluation of W^(z) for a given, set of clusters. 
In this case 


W^z) = min [T(z-y) + W Q (y) ] = T(z), 

y 

where z represents a given set of objects. The quantity W^(z) 
is computed for each of the possible clusters at the first stage. 
The maximum number of objects available for a cluster in the 
first stage, denoted by max (1) , is given by 

max (l)En-m+l, 


that is, the largest cluster has n - m + 1 objects in which case 
the remaining clusters would be single-object clusters. The mini 
mum number of objects, denoted by min (1), in a cluster in stage 
1 is 


min (1) = n/m 


if n is an even multiple of m, and 

/ 



, min (1) 


[n/m] + 1, for 1 <_ n - m[n/m] , 

I n - (m-1) [n/m] , for n - m[n/m] < 1 < m. 
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when n is not an even multiple of m, where [n/m] denotes the 
largest integer <_ n/m. The total number of clusters available 
for the first stage, denoted by NS(1) is given by 

max (1) 

(4) NS (1) = l (!?) 

j=m.in(l) J 


The first stage of the algorithm consists of computing the quantity 
T(z) for each of the NS(1) possible clusters. 

In general the maximum number of objects in any one state in 
stage k is equal to the maximum sum of distribution form compo- 
nents from stages 1 through k inclusive. The minimum number of 
states is the minimum sum of the distribution form components. 

For max (k) and min(k) we have 

(5) max(k) = n - m + k 
and 

(6) min(k) = k [n/m] 

if n is an even multiple of m. If n is not an even multiple 
of m we have 


(7) 


([n/m] + l)k, for 1 < k < n - m[n/m] 


mm 


in (k) = < 


n - (m - k) [n/m], for n - m[n/m] < k < m. 



The number of states available for stage k is given by 

( 1 for k = 0 


(8) NS (k) =/ 

max(k) 

l <!J) for k = 1,2,... ,m. 

i j=min (k) 3 


The 

gramming 


total number of states available in the dynamic pro- 
formulation is thus given by 


0 

(9) l NS(k) . 

k=0 

A very important quantity in the formulation is the total 
number of values for W^Cz) in going from stage k-1 to stage k, 
that is, the number of ways of forming a state in stage K. 

States in successive stages are connected by arcs. Two states, 
in stages k-1 and k, are connected by a feasible arc if the objects 
in the state in stage k consist of objects in the state in stage 
k-1. That is a feasible arc cannot exist between a state in 
stage k-1 and a state in stage k if an object contained in the 
stage k-1 state is not contained in the stage k state for 2 jc k _< m^. 
In the dynamic programming algorithm the total number of 


feasible arcs is given by 
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where TA(k) represents the total number of feasible arcs between 
stage k and stage k+1 for k = 1, 2,...,mQ. The value of TA(k) 
is given by 


(ID 


max(k) max (k+1) -min (k) 

TA(k) = l l FA(j,i) , 

j=min(k) i=l 


where 


( 12 ) 


(j) ( n ^- i. * 3 ) if min (k+1) £ i + j £ max 


(k+1) 


FA(j,i) = ( 


otherwise 


In (11) and fl2) , i denotes the number of objects among a class 


of (feasible) states at stage k. There are 


containing i objects/ since 


n 


such states 


is the number of subsets of size 


i. The quantity j denotes the number of objects to be combined 
with the i objects to form a new state at stage k+1. Obviously 
we must have min (k+1) £ i + j £ max (k+1) for a state of size 
i + j to exist at stage k+1. If i + j satisfies the required 
condition, then there are sets of size n-i that may be 

added to the i objects j at a time. 

Jensen gives a way of computing the efficiency of dynamic 
programming relative to complete enumeration. Efficiency is 
defined as the ratio of the total number of transitional calcu- 
lations under dynamic programming to the corresponding number of 
calculations under complete enumeration. Alternatively, the 
numerator can be taken to be the total number of feasible arcs. 
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In either case the dynamic programming procedure is quite efficient. 
However, the dynamic programming procedure requires more computer 
memory and consequently slow-access storage could make it less 
useful than complete enumeration. In any event for large n and 
m one might be better off using some other technique such as 
ISODATA or hierarchial procedures. 

In order to illustrate Jensen's formulation consider the 
example with n = 6 and m = 3. In this case n = 2m so we need 
to consider n-m=6-3=3 stages, i.e. mg = 3. Furthermore, 

max(l) =n-m+l=4 
min (1) = ( [6/3] ) 1 = 2 
max (2) = n- m+ 2 = 5 
min (2) = ( [6/3] }2 = 4. 
max (3) = n - m f- 3 = 6 
min (2) = ([6/3]) 3 =6 

i • 

as can be seen from Table 3.1 

The total number of states in stages 0, 1, 2, and 3 are given 

by 


NS (0) = 1 




These states are listed in Table 3.1. The total number of states 
is thus 73. This figure agrees with Table 3.1 if NS(0) = 1. 

From equation (12) we have, for k=l. 


FA(3 ' 1) = (3) (1) 

= 60 

FA ( 3 , 2 ) = («) ( 3 ) 

= 60 

».«<•»■ - .(s). $ 

= 30 

FA (2, 2) = (|) Q 

= 90 

FA (3,3) = FA (4,2) 

= 0 


and the total number of feasible arcs between stages 1 and 2 is 

TA (1) = 240. 

Similarly for k=2 we have 



Thus the total number of feasible arcs in our example is, by (10) , 


2 

TFA = NS (1) + l' TA(k) = 50 + 240 + 21 = 311. 
k=l 

From section 1 it was seen that half of the stages in stage 2 
corresponding. to distribution form components {2}, {2} are re- 
dundant and that the 60 arcs corresponding to components {3} {1} 
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ultima tely lead to the form (3) (l) {2} which is equivalent to 

(3) {2} (l). Thus in the reduced formulation the number of feasible 

arcs, denoted by NA, 

NA = 50 + 135 + 21 = 206. 


The number of feasible arcs, between stages k and k+1, elimi- 
nation is given by 

max(k) max (k+1) -min (k) 

NA(k) = l l A (i , j ) 

i=min (k) j=l 


where 



and 


min (k+1) £ i+j <_ max (k+1) 
(m-k) j + i > n 


The total number of arcs in the reduced formulation is. given by 


* *i 0 -i 

NS (1) + l NA (k) . 
k=l 


It can be verified that (13) .yields 206. 

The maximum number of feasible arcs that must be evaluated 
in the dynamic programming formulation is then 206. 

To illustrate how the dynamic programming algorithm/ operates 
let p = 2 and let the six objects be (1,1) , (3,4) , (5,5) , (4,4) , 
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(1,2) / anu (5,6) or 


(l 3 5 4 1 5\ 

U 4 5 4 2 6/ * 


The sauared distances are then 


d 12 = 13, d^ 3 = 32, d^ 4 = 18, d^ 5 =1, d 2 g = 41, 

d 23 = 5 ' d 24 " d 25 = 8 ' d 26 = 8 ' d 34 = 2 ' 





5 / 



According to the dynamic programming algorithm we would 

have: 

Stage 0 : Wq(0) =0. 


Stage 1 : Compute W^(z) = T(z-^) + Wq ( y) = T(z) + W(0) = T(z) 
for each set of objects in stage 1. For example 


W 1 (l, 2, 3, 4) = T (1 , 2, 3, 4) + W Q (0) 



. ,2 ,2 ,2 „2 ,2 

d 12 + d 13 + d 14 + d 23 + d 24 


+ d 


34 


= 17.75 


There are 50 such values for stage 1. 
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Stage 2: Compute W 0 (z) = min(T(z-y) + w., (y) } for each set of 

y 

objects in stage 2. For example 


W 2 (1, 2, 3, 4) = min (T(5) + 

T (3) + 
T(l) + 
~T ( 1 , 3 ) 
T (1 / 5) 
T(2,4) 
T (3/ 4 ) 
T(4,5) 


W x (1,2,3, 4) , 

T (4) + 

V? 1 ( 1 , 2 , 3 , 5 ) , 

W 1 (l,2,4,5) , 

T (2) + 

W 1 (l,3,4,5) , 

W 1 (2,3,4,5) , 

T (1 , 2) 

+ W 1 (l,2,3) , 

+ W x (2,4,5) , 

T (1 , 4 ) 

+ W x (2,3,5) , 

+ W 2 (2,3,4) , 

T (2 , 3) 

• . 

+ W 1 (l,4,5) , 

+ W^l.,3,5) , 

T (2 , 5) 

+ W x (1,3,4) , 

+ Wjl (1,2,5) , 

T (3 , 5) 

+ 1^(1, 4, 5) , 

+ W' x (1,2,3) } 

• 



Stage 3 : Compute W,(z) = min{T(z-y) +*W~(y)} for each set of 

y 

objects in stage 3. In this stage z represents the one set of 
objects (1,2 ,3,4 ,5,6) . There are 21 feasible arcs between states 
in stage 2 and states in stage 3. Thus, we would choose the 
minimum of 21 values. As an example one of these 21 values is 


T (2 , 4) + W 2 (l,3,5,6) 

corresponding to state number 15 (see Table 3.1) in which case 
y corresponds to the set (1,3,5, 5), z corresponds to the siet 
(1,2 , 3 , 4 , 5, 6) and z-y corresponds to the set (2,4). 
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The results of the dynamic programming procedure are the 
clusters (1,1) and (1,2); (3,4) and (4,4); and (5,5) and (5,6) 

with distribution form {2}, (2), (2). The minimum value for 
W is 

W (1,2, 3, 4, 5, 6) = 1.5. 

The results are displayed in Figure 1. 



Characteristic 1 
Figure 1. Graph of n=6 objects 


I 
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4. Concluding Remarks . 

It is apparent that the dynamic programming technique dis- 
cussed in this report will not prove useful in a remote sensing 
data situation in view of the large magnitude of such data. The 
technique discussed herein does, however, yield an optimal par- 
tition. It appears that the search for a useful dynamic program- 
ming technique will yield one which strives for a local optimum 
at each stage of the process. 
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1. Introduction 

When misclassification costs are equal and prior probabilities are 
equal, for classifying an individual I(x), observation on whose p character- 
istics is p x 1 vector x, into one of two normal populations tt, and tt 2 
with densities N p (u.j>D» (i=1 »2) , the Bayes procedure that minimizes the 
total misclassification probability partitions the p-dimensional real 
Euclidean space E p into two regions R-j and R 2 9 1ven by 

R^= ^ = {x:(m 1 -w 2 ) T Z " 1 [x-l/2(u 1 +y 2 )]> 0 . (1) 


The misclassification probabilities are known (Anderson [1] p 136) to be 
given by 


P ( 2 1 1 ) = P(XeR 2 |I(X)eir 1 ) = *(-A/2) 

P(l|2) = P(XeR 1 (I(X)e-rT 2 ) = 

T -1 

where A 2 = (u-j-y 2 ) l~ (p-j-m 2 ), Mahalanobis distance between and 


( 2 ) 

and 


x 

$(x) = / 


—09 


exp(-t 2 /2)dt/y2rT. 


When the parameters are unknown, they are estimated from the training 
samples from each population. In practice, the true values of the para- 
meters occurring in (1) are taken to be the value of their corresponding 
estimates obtained from the training samples of large size. These esti- 
mates are obtained on the assumption that the samples are independent . But 
in reality, especially when the data are obtained by remote sensing 

t 

technique, the samples are never independent. Often the assumption of 
independence may at best be approximately valid. So, it will be perhaps 
rational to assume the samples to be equi correlated, that is, all pairs 
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of these samples to have the same correlation. In this paper v/e investi- 
gate the effect of such intraclass correlation on the misclassifi cation 
probabilities of linear discrimination function. 

2. Basic Concepts 

The random vectors X^,...,X n are said to be equi correlated (Basu, 

Odell and Lewis [2]) if 

(1) D(X 1 -) = E[X.j-EX.j ) (X^-EX^ )^] = £, a symmetric matrix, for all i, 
and (2) Cov(X- ,X.) = E[(X.-EX. )(X--EX .) T ] = R, a symmetric matrix, for 
all i f j. If Xi,...,X are equi correlated random vectors, then the 
dispersion matrix V of their joint distribution is given by 

= I n ® (l- r) + E n 0 R (3) 

where A ® B denotes the Kronecker product of the matrices A and B, I n 

is the n x n identity matrix and E„ the n x n matrix all of whose elements 

n 

are 1. 

The random vectors X-j,...,X are said to be simply equi correlated if 
(1) D(X^) = £, a symmetric matrix, for all i, 

and (2) Cov(X. ,X.) = p£, p being a scalar constant for all i f j. 

* w 

If X-j,...,X are simply equi correlated, then the dispersion matrix V of their 
joint distribution is given by 

V = [(l-p)I n + E n ] © l . (4) 

Obviously V in (4) has been obtained from (3) by substituting p£ for R. 


V = 


l R •• 

R I 

• • 

• • 

Ir r .. yj 


Let us define the np * 1 random vector X as 



and suppose that X has a multivariate normal distribution with mean and 
variance given by 

EX = e n ® y and D(X) =' V = I n ® (£- r) + E n © R 

where e n is the n * 1 vector, all of whose components are 1. Also, let 
B be the n * n orthogonal matrix given by 



JL 


J. 

v'n 

ft 

ft 

ft 

J_ 

JL 

0 

0 

S2 

ft • 




JL 

JL 

0 

/6 

ft 

ft 


_1 

0 

0 


| 1 < 1 1 -(n-1) 

]_/n(n-l) /n(n- 1 ) /n(n-l ) /n(n-l ) 

Then the p x 1 random vectors Z^Zg*... »Z n> the n components of the np 
random vector Z given by 

Z = (B ® I p )X 

are independent, since 
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DZ = (B © I p ) DX (B Q I p ) 

(B ® !„)[!„ © (I-R) + E n ® R](B T @ I p ) 


p' *-"n 
i+(n-l)R <t> 


<P 


I-R 


<p 


I-RJ 


(9) 


Also, EZ = (B © I )EX = 


<p 

l i J 


00 ) 


Thus, for all i ( 2j< i <n ) Z^ ^ N p (<j>,£-R), that is, if X-|,...,X n are equi- 

correlated samples from a N (v,J) population such that their joint disti 

P 

bution is given by (3), then Z 2 ,...,Z n are independent samples from 

V 

by 


NpU^-r) population. The maximum likelihood estimator of £-R is given 


I Z.Z/(n-l) 

1=2 1 1 

Since B $ I p is an orthogonal transformation, it is well known (Ander- 
son [ 1 ] p. 52, Lemma 3.3.1) that 


01) 


‘ 1=1 




I z,z 

i=l 1 1 


02 ) 


Also from (7) and (8) it is evident that Z^ = Sn X. Therefore 


n 


l ZjZj = l X x! - n 
i=2 1 1 i=l 1 1 


T - XX T 


= l (X,-X)(X r X) T . 
i=l 1 1 


03 ) 
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Thus when the samples are equi correlated 

l (X r X)(X.-X) T /(n-l) (14) 

i = l 1 1 

is an unbiased maximum likelihood estimator of £-R, but not of 

3. Effect of Intraclass Correlation Among Training Samples 

Let X-|,...,X n and Y^,...,Y be training samples from the popula- 
tions Np(p-j ,^) and Np(u 2> Z) respectively. In practice, when Bayes 
classification regions (1) are defined on the basis of the training 

samples, y^,y 2 and Z being taken respectively as X, Y and 
n T n -r 

[7 ( X- -X) ( X . - X) + l ( Y . -Y) (Y - - Y ) ]/2(n-l). When the training samples 

i=l 11 i=l 1 1 

are really independent, for large values of n, the misclassification 

probabilities of Bayes procedure are given by (2). When the training 

samples are equi correlated such that the dispersion matrices V and V 

x y 

of their respective joint distribution is given by 

v x *' I,j x (J-R) + E n x R 

and = I N x (E- R) + E n x R, 

for large n X and Y still gives estimates of y-j and y 2 ’» 


S = CZ” (X,-X)(X.-X) T + l (Y.-Y) (Y.-Y) T ]/2(n-l ) 
i=l 11 i=l 1 1 


(15) 


fails to provide a good estimate of £, it then provides a good estimate of 
( l-r ) instead. 

If the training samples are equicorrelated and inadvertently S is used 
in place of £ in (1), then for large n the regions R-j and R 2 become the 
regions 
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/ 


R 2 C =R 1 = {x: (v } -v 2 ) J (I-R)' 1 [x-l/2( Ml +u 2 )] > 0 . (16) 

If we write 

W(X) = (u-j-y^^-R) ^[X-l/2(y^+v 2 )3 » (17) 

then the new misclassi fication probabilities are given by 

P(2|l) = P(W(X) < 0| I(X) e ir-j) (18) 

and P(1 12) = P(W(X) >. Ojl(X) e tt 2 ) . (19) 

Now, W(X) is distributed normally with mean given by 


EW(X) = l/2(ii ] -y 2 ) T (I-R)“ 1 (u r u 2 ) = A 2 / 2 if I(X) e ^ 

and EW(X) = - 1/2( w 2 ) T (X-R)" 1 (u 1 ~y 2 ) = -A 2 /2 if I(X) e 

and variance under either hypothesis given by 

Var U(X) = (u 1 -m 2 ) T (I-R)" 1 I(Z-R)' 1 (m 1 -u 2 ). (20) 

Obviously, 

P(2|l) = P(l|2) = $(-A 2 /2/varW ). (21) 

Case 1 . R - p£ , that is, training samples are simply equi correlated . 

A 2 = (1/2) (u 1 -u 2 ) T l ^ (u-|-y 2 )/(l-P) = a 2 /2(1-p) 

Var W = (' Ul -n 2 ) T r 1 ( Ml -u 2 )/(l-P) 2 = 4 2 /(1 -p) 2 

So, 4 2 / 2 /arH = 4/2. 

Therefore, P ( 1 | 2) = P ( 2 | 1 ) = $(-A/2) . 

Thus when the training samples are simply equi correlated and yet the Bayes 
regions are constructed on the inadvertent assumption of independence, the 
misclassi fication probabilities do not change. 
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Case 2 . Training samples are equi correlated, R / p> 

We consider a numerical example to illustrate how the misclassi fication 
probabilities are changed when the training samples are equi correlated and 
yet Bayes regions are defined with the inadvertent assumption of indepen- 
dence. 

Example . Let tt^ and -n 2 be two 3 dimensional normal population 
N 3 (v.,D where 

P 1 = 



Also let the training samples be equi correlated, such that for both 
population 

0.2 0 0 

R = 0 2 0 

0 0 0 . 

Then a 2 = (u}"y 2 )^ \ 1 (y^-y 2 ) = 7/3, 

A/2 = 0.7638 

A 2 = (y^.yg)^ (Z-R) ^ {u}-v 2 ) = 13 -75 
VarW = (y r y 2 ) T (Z-R) -1 KI-R)" 1 (y r u 2 ) - 81.25 
A 2 /2^arW = 0.7627 



Therefore, the misclassification probabilities for 

actual Bayes procedure: <{>(-a/ 2) = <f>(-0 - 7638) 

uncorrected Bayes procedure: $(-A 2 /2i4arW) = <{>(-0.7627). 

Thus the misclassification probabilities increases when training samples are 

equi correlated .and yet Bayes regions are defined with the inadvertent 

/ 

assumption of independence. 
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ESTIMATION OF PROPORTION OF OBJECTS AND DETERMINATION 
OF TRAINING SAMPLE-SIZE IN A REMOTE SENSING APPLICATION 

I 

1 . I nt roduct ion 

The multichannel spectral measuring devices that are used as remote 
sensors fail to observe any vegetation, flora, etc. grown underneath 
timber on the earth surface. Suppose the latter is observable via a 
spectral measuring device and is, however, identifiable with certain 
uncertaini ty. If we know how the amount of vegetation/flora is asso- 
ciated with different types of timber, an evaluation of the former over 
a large track of land covered by forest, etc. can be made easily by the 
remote sensing technique. 

However, as in most cases, the true parametric values such as the 
probability of correct identification, amount of vegetation/flora corre- 
sponding to various types of timber are unknown quantities. Hence a study 
of the problem first requires estimates of the unknown parameters on the 
basis of samples of both timber and vegetation from the ground. In the 
present report thTs estimation problem is being considered in its general 
form and our approach constitutes a two-stage sampling process where at 
the first stage samples consist of individuals called primary units, and 
at the second stage samples consist of categorized elements called sub- 
units (e.g., timber and vegetation types at first and second stages, 
respectively, in the above example). A formal formulation of the problem 


is stated as fol lows : 
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Let II. , i=l,2,...,m be m different classes and every individual from 

these classes be characterized by p common observable features so that a 

measurement vector X=(X, ,X_, . . . ,X is associated with an individual I ( X ) 

l 2 p 

from each class. Next, associated with these individuals let there be 
another kind of elements categorized into k groups with proportions Pjj» 
j=l,2,...,k for each i. Further we assume that at least one element (sub- 
unit) is associated with each individual (primary unit) from every class. 

On the basis of an observation X, the associated primary unit l(X) may be 
mi sc lass i f i ed, and let P(i|k) denote the probability of mi scl ass i fy ing l(X) 
into u. when it belongs to and P(i|i) denote the probability of correctly 
classifying I(X) into its class tt . . Then, given an observation X, the 
expected proportion of jth category subunits associated with primary unit 
l(X) from Hj is given by 


m 

e. . = y p^ . 
' t=l tJ 


p(tji) 


( 1 ) 


»=1,2, — ,m and j = 1 , 2 , — ,k . 
k 

Note that for any fixed i, £ e. . = 1 and e. . = p.., j=l,2,...,k, if 

-z J IJ IJ I J 

and only if P(iji) = 1. But the later condition is an ideal one and often 

is not achievable. However, an effort should be made to separate out the 

underlying classes maximum possible, and thereby to obtain maximum possible 

values for P(i|i), i = l ,2, . . . ,m, so that p . j * s can be ascertained with minimum 

possible error. Otherwise, the evaluation of p. . provided by e. . can be 

*J U 

very misleading. 


I 
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For an estimate of e.., one needs to obtain estimates for P(tji), P f j » 
t, i=l ,2, . . . ,m and j=l,2,...,k. For that, samples of primary units and of 
subunits are required from each class. Below in section 2 we outline a 
sampling procedure and introduce some of the notations being used later 
on. Our main results are obtained in section 3 and section b where we 
will discuss the interval estimation of e . j * s and the determination of 
training sample size so that for a given probability an estimate allows 
only a specified amount of deviation about each e.j. 


2. Notations and Sampling Procedure 
Without loss of generality, let there be two classes tTj and 
Further, suppose the measurement vector X is distributed multivariate normal 
with mean y if i(X) e tt j and mean v if I (X) e rr^ and has variance-covariance 
matrix J for both classes. Then it follows by maximum likelihood principle 
[1] that P( 1 1 2) = P (2 1 1 ) = 1>(-A//2), where 


A 2 = (y-v) T l 1 (y-v) 


and 


a 2 

$(a) = (l//2^)/ exp (-y / 2) dy . 


In case, y, v and £ are known, P(l|2) and P (2 [ 1 ) will be known. So in 
order to estimate e^ and e 2 j * one ° n, y needs to estimate Pjj ancf P 2 j » j — 1 » 
2,...,k. This will be achieved by sampling Nj primary units randomly from 
ifj and N^ from tt^ and then determining separately the observed proportions 
of k categories of subunits associated with these Nj and sampled primary 
units. 
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Wben jj, v and £ are partially or completely unknown, A will be 
unknown and so. also P ( 1 j 2) and P (2 ( 1 ) . Then for estimating any e„ it will 
require two estimates, one for p.j and the other for P ( 1 [ 2 ) and P(2|l). 

The sampling procedure in that case will be to select randomly Mj and 
primary units from ttj and -n^, respectively. The observations for these 
selected units will be utilized to estimate A and thereby P ( 1 J 2 ) and 
P ( 2 j 1 ) . Next, Nj out of Mj and N^ out of primary units are again ran- 
domly selected and these Nj and N^ units are used similarly to the previous 
case in finding estimates for p.j, i = 1 , 2 and j = l,2,...,k. 


3. interval Estimation for e. . *s 
U 

yi, v and J all known 

Let n jj s denote the number of the jth category subunits associated 

with sth primary unit randomly selected from tt.. Also, denote 
* 

N. u 

i k 

n 5 : “ l and n = £ n J-1,2 k , 

• J s= , »js i j = , U 

A 

for N. randomly selected primary units from i=l,2. Then p. . = n../n. 

• i 'll l j i 


is an unbiased estimate of p.., and so also 

•J 


2 . 


n * I Pm 


U 


( 2 ) 


t=l 


°f e jj* ' = ar) d j = • » 2 , . . • , k . This can be easily seen because the sample- 
size n. of subunits being a direct consequence of the sampled primary units 
N. can be recognized fixed for a specified value of N. and P ( t ( i ) * s are 

A 

known quantities. So E[e.j] = e„ and 
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Var (e..) = [P(t|i)] 2 P t j ( 1_p tj^ /n t 


Cov (e. j .e. j , )= - [PCti')] 2 P tj P t ji /n t » 


tj r tj 


i = l ,2 and j = l ,2, . . .k. 


By the large sample theory, asymptotically the random vector 

A A A A J 

e. = (e.,, e._, .... e.,) has multivariate normal distribution with 
i 1 1 1 2 ik 

mean e. = (e.j, e j 2 » •••» e j^) and covariance matrix E consisting of 
elements in (3a) and (3b). So a 100(l-a)% confidence region for e. is 
approximately given by the ellipsoid of points e.'s satisfying 


( e i j~ e i j) T E 1 ^ 6 i j _e i j ^ - X a 


where E is an estimate of E obtained by replacing p..'s by their estimates 

^ l 

p..'s and x 2 (P"0 is the 1 00 ( 1 -ex) % quantile for x 2 variate with (p-1) 

i j oi 

degrees of freedom. Considering the coordinate-wise projection, this 
yields the simultaneous confidence intervals 


e ij ± ^ X a ( p_l ) J (P (t J » ) ] 2 P t j 0-p t j)/n t ] 


for e.j, j=l,2,...,k and i-l,2= 
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I 


3.2. Not all of yi, v and J known ^ 

Here we need to estimate both p.j's anc * P(t|i)*s in (1) for an 
estimate of e . y ?=1 ,2 and j=l,2,...,k. Since 


P(t|i) = 


l-4>(-A/2) , t=i 


$(-A/2) , 


( 6 ) 


with i=l,2, the estimation of any P(tji) amounts to estimating the quantity 

A rfs 

4>(~A/2) . For the later an estimate is given by 4 >(-a/ 2) where A is the 
maximum likelihood estimate of A based upon samples observations 
X^ of Mj randomly selected primary units from rc j and observations 

A 

of M 2 randomly selected primary units from ir Since $>(-A/2) is 
a consistent estimate of $(-A/2) , due to (6) it leads to consistent esti- 

A 

mates of P(t[i), i = l,2 and t=l,2, denoted by P(tji). Next, as in the previous 

A 

case, for an estimate of p.., let p. . be the estimate obtained from subunits 

»J k ij 

associated with Nj and N 2 randomly selected primary units from Mj and M 2 
respectively. Then the estimates 


2 „ 

e iJ " l P t : P (t|«) 

U t=1 tj 

for e . j , i-1,2 and j=l ,2, . . . ,k, are consistent. 

For our purpose of finding an asymptotic simultaneous confidence 
intervals for e.^'s, it is suffice to find the mean square errors (MSE) 

A , A 

for these estimates. After considering the two estimates p^yand P ( t [ i ) 
stochastically independent so that 


( 7 ) 
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Var (p.j P(t | i)) = [E(J tj )]2[Var(P(t|i)] + [E(P(t | I ) ]2Var (p ) 

+ Var (p t j) [Var (P(t [ i))] , 

A 

one can easily deduce the MSE of e„, i = l,2 and j=l,2,...,k. 

Since Var (P ( 1 i i ) ) = .Var (P(2 { i ) ) = Var (C>(-A/2)), i = l,2, we obtain 

Varfe^) = Var ( p j j P ( 1 1 1 ) ) +Va r (P 2J P (2 j 1 ) ) +2 Cov(p ,^(1 1 1 ) ,p 2j P(2 1 1 ) 


3P, .(1-P, .) 3p 9 ; (1-p, .) 

U -' J + 2J +p fj +p 2j" 2p lj p 2j 

n l n 2 


Var ($(- f)) 


A Pi 5 0 “Pi ; ) A Po i ^ ^ ~P? t ^ 

+ [l-E(i(-.|))] 2 -lJ ^+[E(S(- |))] 2 . 

n l n 2 


( 8 ) 


Since it is difficult to evaluate E[0(- ] and Var($(- -~) ) , we here consider 

A 

the mean square error of e.j given by 


MSE(e. .)' = 

»y 


n l n 2 


MSE(*(- §•)) 


A * P 1T 0“P,;) * ; 0“P o ;) 

0 -•(- f )) 2 - ) ) 2 -^ — ^ . 


(8a) 


'i 


n„ 



Simi lar ly, 


MS£(e 2j ) = 


t 3P 2j ( ' ~ P 2j > + < p 


(p lj" P 2j )2 


MSE($(- |)3 


A Pi;0“Pi*) a P^;(^~Po;) 

( 4 (. |))2_L! Lu + (!-,(- |))2 _2j IL . 

n l n 2 


We denote MSE(e..) = s.. and let its estimate s.. be obtained by replacing 
i j i j ij 

unknown quantities by their estimates. 

Now as in the previous case, an approximate 100(l-ct)% simultaneous 
confidence intervals for e jj» i = 1 » 2 and j = l,2,...,k are given by 


e.. ± [s. .x 2 (p-l)] 
i j tj'Nx r 


1/2 


1=1 ,2 and j=l,2 k, respectively. 

The two particular cases of interest are (i) y,v are unknown and £ 
is known and (ii) .y,v and £ are all unknown. For (i), the maximum likeli 
hood estimate of A 2 is given by 


A 2 = (X-Y) T jf 1 (X-Y) 
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and for (ii), this estimate is 

a 

. . A 2 = (X-Y) T S" 1 (X-Y) 

N 1 N 2 

where X = 7 X./N, , Y = Y Y./N_ and 

j il y i 2 


01 ) 


1 _ _ T 2 

(Nj+N 2 -2)S = I ( X . — X) ( X - - X) T + l (y.-y)(y.-y) t . 

k. Sample S i ze 

Presently our concern is to determine the sample size so that only a 
specified amount of error for e.j's is allowed by their estimates with a 
given probability. In specific terms the problem is to find (nj.n^) and 
consequently' (N,N 2 ) so that e. . fall simultaneously in intervals given by 
e.j + r.j, i = l,2 and j=\,2,...,k, with probability (1-a) . However, this 
is equiyalent to obtaining (nj ,n ) when the length of a confidence interval 
for e.j with confidence level 100 (l-a)% is given. Hence, based upon the 
discussion in section 3. an asymptotic solution for the sample size is 
available from equations (5) and (9) for the two cases considered above. 

Suppose y , v and \ are known. Then using equation (5), an asymptotic 
sample size (nj.n^) is the solution of 

2 . 

I [P (t 1 i ) ] 2 P ti (l-P ti )/n = y?./x 2 (p-l) , (1.2) 

*==1,2 for any j. After simplifying (12), we obtain 

", - C P- j 0-p ,j)/( [ y , jP(2|2) ] 2 - (y 2j .P(I|2)) 2 ) 


( 13 ) 
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and 


n 2 = c P 2 j 0' p 2 j)/( t Y 2 j p ^ I ' CYj j p (1 ! 2) ] 2 ) 


(14) 


where 


C - ([P(l|l> P<2[2)]2 - [P(l J 2) P(2|l)l 2 ) X 2 (p-I) 


Note that for each j one finds (n^n^) from (13) and (14). So by taking 
the maximum of these solutions for each of n^ and n^ separately a determina- 
tion of sample size is obtained. However, by a judicious choice of Yjj» 
1 = 1,2 and j=l,2,...,k, (13) and (14) may yield the same value for all solu- 
tions of nj and so also for n£- Then any such common solution (n^n^) will 
be the desired sample size. 

When at least one of y,v and £ is unknown, a similar asymptotic deter- 
mination of sample size is obtained from a solution of 


s ij = 


05) 


1 = 1,2 for any j. Denoting MSE (<J>(- j)) 
follows from (8a), (8b) and (15) that. 


s_ and its estimate by s . it 
0 0 


A A A A 

* A n * A . Pi I ^ ”Po i ) 

[3s.+( 1-«(4)) 2 ) + l3s„+(«(- #)) Z J -ti 


(p U' P 2j )Zs 0 + 


lj 


X 2 (P-1) 



- 121 - 


and 


• -A P,;0-P,;) - A Po ? 0 "Po ? ) 

[3s 0+ (<,<- f»*l (3 s 0+ O-<.(- f)) 2 3 lj - 




Then for a determination of (n^n^), we have 


(a 2 -b 2 ) j Q-P)}) X^(p-0 

1 (p lj" P 2j )2 f }) *0 X a (f>_1) + aY lj " b7 2j 

and 

(a 2 -b 2 ) p 2j 0-p 2 j) X^(P-Q 

2 (Plj~ P 2j) 2 S 0 x a^ p_1 ^ + aY 2j “ bY ?j 

where 

. a = 3 s q + ( 1 -4>(-A | 2) ) 2 


( 16 ) 


07 ) 


b = 3s Q + 0(-A|2)) . 

Once again, nj and n^ are determined by taking the maximum value among 
such solutions of n^ and n^ respectively for j=l,2,...,k or by having a 
common solution derived from a judicious choice of y.., I®=1,2 and j=l,2,...,k. 


I 
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5. Univariate Case 

In order to provide a specific and also somewhat interesting result, 
we specialize the problem to the case where the random measurement X is 
univariate having normal distribution with mean if l(X) e itj and y 2 if 
l(X) c tt ^ and with variance 0 2 in both cases. Instead of considering maximum 

A 

likelihood estimates e..'s given in section 3> we want to find an unbiased 

A A 

estimate for e.., 1=1,2 and j = 1 , 2 k. Since any p.j and P(i|k) are 

stochastically independent and p.j = n.j/n. is the minimum variance unbiased 
estimate (MVUE) of p.j, a similar estimate of P(ijj) in (7) leads to the 

MVUE of e. . , i = l,2 and j=l ,2 k. 

In order to find the MVUE of any P(t[i), it is suffice to find the 
similar estimate of $(- — ) where a = iy]"P 2 K 0 ' Without loss of generality 
let Pj > y 2 - It follows by theorem 1 in Ellison ( 1 964) that (2U-1) /v S 
has normal distribution with mean 0 and variance a 2 where U independent of 


S 2 “ I (Xj“ X) 2 + ^ (Y . -Y) 2/ (Mj+M^- 2) is distributed as 3(— , beta distribut 


v-Mj+M 2 -2. Then the random variable 


i I( 2U-l)S«^iT + (x-Y)] 


M, M,' 


has the normal distribution with mean (p j — y 2 )/ 2 and variance a 2 . Accordingly, 


we observe that 


$(-J)=Prob { (2u-i ) s A[k-(L- + i_)j + (jT-y) < o) 


=E [Prob{ (2U- 1 )s A> [^- — -) 3 +(x-y) <_ 0jx,y,s}] 


-123- 


where E stands for expectation with respect to variates X, Y and S. 
Now from (18), the MVUE of $(- is 


Prob { (2U- 1 ) s + -i-) ] + (x-y) _< 0|x,y,s} . 


'1 “2 


Thus the MVUE of $(- |) is 


Prob (U < j - (x-y)/2s + p— ) ] j x,y,s} 


(19) 


where U has ) distribution. S i nee- extens ive incomplete-beta 

integral tables are available, (19) can be easily evaluated for any given 


values of x,y and s obtained from sample observations on Mj and M 2 individuals 
from it j and -n 2 respectively. 

Denote 


u = 


1 


x-y 


2s Vv(k- + tt-)) 

M l H 2 


Then 


p(t|i) - 


f\ - i(- |) . t = i 


4> (~ j) , t f i 


( 20 ) 


where 


4> (- f) 


1 w 

j— / r /, \ i (v“3)/2 

[u\l-u)j' 

2 2 ) 0 


ou 


( 21 ) 


u 
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Since the use of (20) leads to the MVUE of e.^ given by 


2 . . 

e. ; = l p ( t i') » 


it is convenient to find var(e.j). This easily follows from (8) if 


A ^ A yy 

var( 4 >(- -r-) is evaluated. For that we only need to evaluate E[(o(-y)) 2 ] because 


we already know E [<!>(- -z ) ) = <5>(- y) . Observing that (X~T) /v/S is a non- 


central t variable with v degrees of freedom, denoted by t(A,v), we have 


E[(<S>(- j)) 2 ] = E [(Prob {U £ - (x-y)/2s A> [4- (~p + —-) ] | x,y,s}) 2 ] 


= E [(Prob (2U-1 _< -t(A,v)/v /[4- (^- + yp) ] |t}) 2 ] 


- E[Prob (max(2U | -l, 2U.pl) < -t(A,v)/v /[4- (^- + — )] |t)] (22) 


where Uj and U 2 are two independent random variables, each distributed as 


B(~ 2 ~,“ 2 “) . If we let W = max (2U,-1, 2U 9 -1), the density function of W is 


f (oi) = 


2 v -3 B (Vli,vii) 


1 _ nwi (v " 3)/2 n-ni (v " 3)/2 i /v-i v-k 

Ll vlK U+u) 0 u) ( l+u>) /2 ' — ?• -1<0)<1 


where I^(a,b) stands for the i ncomp 1 ete-beta integral, and 
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u 


F(u>)=Prob {W <_ cj}= I f (w) dw 

-1 


2 0+w)/2 v-3. v-3 , i 

v-hnr ' ^ 




.0 


(l+u)/2 - , “ Bt^jkn+l) . 

/ y 2 0-y) 2 m l y^- TTr y 'h> 


n £ 0 B (v- 1 , n+1 ) 




> (v-1)B2(Y-.V ) 




V-I V-lV |»(v-l,v-n I |+u (v-l.V-l) 




+ L B{v-l','n+l) B(n+V - V ' ,) I i+ U) (n+v.v-1)] 

n u £ ^ 


Then from (22) 

£[(«(- -I)) 2 ) - / F(-t A/t‘,-(i- + i-)l) 


M 1 »2 


g(t;A,v) dt 


(23) 


where g(t;A,v) is a non-centra 1 density function involving the non-centrality 
parameter A and v degrees of freedom. As it is somewhat difficult to evaluate 
the right side exactly, it can be easily computed numerically for a given 
value of A and v. 


At present we are seeking an estimate of var(e.j) so as to be able to 


evaluate the standard errors s.^, i=l,2 and j=l,2,...,k. For that purpose 


E[(*(- f)) 2 ) can be estimated by evaluating the right side of (23) numerically 
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after considering A and v given by A and (Mj+H^“ 2) respectively; and 

similarly 4>(-‘ j) can be used replacing E(«(- -^j ) • Further, replacing 

p.. by its estimate p.., one thus obtains the s.e. s.., i = l,2 and 
ij » j » j 

j*=l,2,.,.k, and then from (16) and (17) the sample size (nj.n^) is 
determined after replacing Sq by the estimate of Var ($>(" ^) ) • 


6 . Remote Sensing Application ", 

The previous discussion on estimation and determination of sample 
size though treated in general has. primarily been motivated by our need 
of finding desirable estimates for the expected proportion of various 
types of vegetation/flora over a certain region covered by different 
types of timber using remote sensing techniques. But the analogy seems 
to exist in many other cases dealing with mul t ispectral sensor data 
because it is not unlikely for different types of objects to be within 

v 

the instantaneous field of views of a mul t i spectral scanning device. Hence, 
the above discussion can be applied to ascertain the contribution of each 
type of objects making up a resolution element that gives rise to an obser- 
vation obtained by a remote sensor. The analogy may briefly be outlined 
as fol lows: 

Without loss of generality let there be two classes of resolution 
elements. The measurements on these resolution elements in each class are 
supposed to be normally distributed and on the basis of a measurement the 
resolution element may be mi sclass i f ied . Further, let there be k different 
categories of objects that might be associated with the resolution elements 
in each class. With this set-up one can now obtain estimates of the expected 
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proportions of the specified categories of objects from (2) or (7) 
depending upon the knowledge about the underlying normal distributions. 
Furthermore, by considering the number of objects selected according to 
(13) and (1*0 or (16) and (17) as the case may be, one can actually ob- 
tain the desirable estimates which approximately allows a specified 
amount of error about the true expected proportions for the object types 
wi th a des i red amount of probabi 1 i ty. 
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ABSTRACT 

A computing technique for adjusting remote sensing 
spectral data for environmental effects is formulated. The 
technique is essentially invariant with respect to the 
atmospheric model used in the paper; hence, it can be replaced 
with a better model. 
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ON ADJUSTING REMOTE SENSING 
DATA USING A RADIATION 
TRANSFER MODEL 


I . Introduction 

In a recent report [1] a radiation-transfer model was 
developed to predict the apparent radiance, L, of a ground 
target, as it is observed by a multispectral scanner 

L = j- ET + L , (1) 

ir p 

where p = the diffuse reflectance of the target material 
E = the irradiance at the target 

T = the atmospheric transmittance from the target to 
the multispectral scanner, 

= the path radiance. 

P ' 

The following parameters are used to describe the conditions 
of observation: - 

1. t ( h) = optical depth of atmosphere at altitude h of the 
sensor. 

2. y = cosine of zenith or nadir angle. 

3. ($ - <J)q) = angle between scan direction and sun's azimuth. 

4. 0 = nadir scan angle. 

5. 0q = zenith angle of sun. 

6. V = visibility (visual range) of the atmosphere, or some 
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better estimate of the distribution of haze and scattering 
particles present. 

7. p = averaqe diffuse reflectance of the scene that 
K avg 

contributes to the path radiance scattered into the sensor's 
field of view at wavelength X. 

8. n = anisotropy parameter. 

9. Tq = total optical depth. 

10. Eq = Solar irradiance on a surface where normal lies in the 
direction given by y^ and <J>q. 

In order to show the functional dependencies involved in 
equation (1) , it can be rewritten as 


L[p/0/(4> < ^o) / "^(h) , ®o , V # P aV g ' ^ ^ 


m 7 T - x - ) - E[T<h) /V v,p avg' x] t [ e / t (h) ,V,X] 


bp [ 0 , ( $ - 4*q) / *r(h) / OqjV / P aV g / ^ 1 • 


II. Main Content 

2 . 1 When Observations are known to be generated by same target . 

We want to consider the radiation-transfer model 
statistically as a covariance analysis model with covariates 
given by 1 through 10 and obtain the best estimate for E, the 
irradiance at the target, once the observations L have been 
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ad j us ted for the atmospheric conditions. For ease of presenta- 
tion, the covariates will be denoted by B^,...,B^q. 

The model in equation (1) can be written as 


i 1/2,. ..,N (2) 

where the e^ are non-observable, uncorrelated random errors 

2 

each with mean zero and variance a . 

The problem is to obtain estimates of which 

minimize 

Q = [L - GJ T [L - G] 


T T 

where L = (L^,...,L N ) and G = [g^,...,g N l • 

Direct search techniques [2] have been used with success. 

One simply searches judiciously for the value of 3 = (3^/...,3 ^q) 
= B ls which minimizes Q, by approximating 3 LS with an a priori 
value, say B^, and then iteratively computing 3^. + ^ so that Q 
approaches a minimum. 

Hartley [3] suggests a method in which he solves for B LS 
by solving the non-linear system of equations 3Q/3B^ = 0 
which is a necessary condition for Q to be minimal. By selecting 
3^ a priori and letting G be approximately 


G(L ;S ) s G(L;6 k ) +S « k+1 - B k ) 

B=e k 


one can solve for B^. + ^, and on iterating the sequence (3^) 
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converges in many cases to 3 TC * 

Lio 

Walling [4] and Nelson [5] have developed techniques 
which take advantage of the linearity in the non-linear 
function G. Instead of approximating G* in a truncated Taylor's 
expansion, they approximate G by < 


G (L; 3 ) . = 

G(L;B k ) + 

3G (L ; 3 ) - 

3G I *"■ 

33 

33 |3=B k 

6 (3) = 

CD 

TX> 

X 

0(3) = (0 1 (3) 

/ 0 2 ( 3 ) ,.. 


3G 

33 


3, (0 k+l “ P k* 

k 


parameter vector and where G* = A0 (3^) + C(L;3) , and A is a 
known matrix of constants. 

Then an a priori estimate B k leads to 3 k+ ^ which when 
iterated gives a sequence (3 k ) which converges in many cases 
more rapidly to 3 rc . 

AjO 

Comparison of Hartley's technique, Walling's technique, 
and Nelson's technique can be found in [5] and [6], 


2. 2 When observations are not knov. T n to be generated by the 
same target . 

In the previous section a technique for estimating 

/* 

3^, i = 1,2,..., 10 and E^ was briefly described. If E is 
the estimate for E the irradiance of the target, then one can 
use this value to perform discriminate analysis. However, this 
is indeed a special case and is not a realistic analysis for 
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the remote sensing application. In most cases one is almost 

t h t h 

never sure if the x and j observations, and X ^ , are from 
the same class until after the discriminate task has been 
performed. The symbol E must be replaced with the in the 
model described in (2) , that is 


= g. 


(Bl/B 2 ' 


' 6 10 ;E i > 


+ e , 





(3) 


Note that in (3) , there are N equations and N- + 10 unknowns , 

3^ , . . . , ,E^ , . . . ,E n , an undetermined system. The estimates 

for E. ,E„ , . . . ,E„ are the values we seek to base our discriminate 
12 N 

task upon, since the values of these estimates would be void of 
any modeled environmental effects. That is 

A A A A 

E ^ — h^ * ’ ’ '^10' E i ^ ( 4 ) 

^ A A A A 

One can solve for E^ if values of 0 ^ • $ 2 ' * * * ' ^10 an ^ L i 
are available. Hopefully, this is the case in the remote 
sensing application. 

However, in the case in which some values from the set 

A A A 

(3 1 , &2 ' * * * • ^10 ^ are un bnown, onG can iteratively estimate E^'s. 
This can be done by the following scheme: 


(a) 

(b) 


Discriminate using the non-ad jus ted measurement 

as E . . 

1 

Collect those resolution cells I(x) such that x e R., 

J 



-1 36- 


then Use these elements to estimate 3^ ' * ' * ' ^10 

and E = E ( ir . ) as in section 2.1. 

3 

Since if x c R., one assumes that E. = E(tt.) = E, the radiance 
3 13 

of an individual from The symbol N.. denotes the number of 

elements I(x) assigned to tt^. 

(c) for each j there exists a set 


$1 ( j ) / ••• 3 — 1/2/.../ 


m. 


These are combined to get a better estimate 


~ m 

B< =• l a.B(j) 
3=1 3 


such that 


m 

a. = N./ 7 N. 

J 3 i 


(d) Using 3^/ i = 1,2/..., 10, compute E^ in the 
equation 


A A 


> L 




and use E^ to perform the discrimination. 
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III. Concluding Remarks 

It is important to note that our purpose here is to 
determine any problem areas in adjusting data for environmental 
factors and formulate a computation scheme to perform the 
adjustment and not select, evaluate, formulate, or modify an 
environmental or atmospheric model. Those whose expertise 
covers the topic of modeling an atmosphere should select the 
"best" model. The computation procedure suggested here is 
essentially model invariant. 
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ON ESTIMATING THE PROBABILITY OF MISCLASSIFICATION 1 

by 

B. S. Duran, H. L.. Gray, J. Tubbs, and T. L. Bouillon 

Texas Tech University 


1 . Introduction 

The problem of discrimination has long been a problem of in- 
tense interest (see for example [2], [9]). Given m populations 
■n 2 ' • • • ,1T n\' the P r oblem is that of classifying a p*l vector observe 
or an individual I (x) corresponding to x, as belonging to one of 
the m populations. Anderson ([2], chapter 6) discusses the classi- 
fication problem when the populations ?, 1 ,n 2 , . . . , 11 ^, are multi- 
variate normal with equal covariance matrices. In the classification 
problem when m = 2 and £ 1 = £ 0 = E , the discriminant function is 
given by 


(1) U = x^*£ 1 (y (1) - m (2) ) 
(11 ' 2 ) 

However, if p ,y ' , and E 

criminant function 'to use is 


1 , (1) , (2K -1, (1) (2), 

2 (y • + y )£ (p - v ) . 

are unknown then a reasonable dis- 


(2) V = x T S - x (2) ) - | (x (1) + x (2) )S 1 (x (1) 


x (2) ) 


where x^ = £ xf-.^/ni/ x^ = y xpVn 0 , and 

j=l 3 


j=l 


n. 


(n. + n 2 - 2 ) S = l (x. (1> - x (1) )(xP - x {1) ) 

- 1 - 2 nil 3 ' -3 


n. 


*1 (xf>- d 2) ) T . 

j«i 3 3 

rhis research was partially supported by NASA Manned Spacecraft 
Center under Contract NAS 9-12775. 
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The distribution of U is the normal distribution N(--a,a) if 
x is distributed N(y^,E) or N(-3]-a,a) if x is distributed N(y^,z), 
where a - (y ^ - y ^ ) £ ^(y^ -y^). The discriminant function 
(2) is a special case of a class of statistics considered by Wald [1C] 
Further work concerning the distribution of V has been done by 
Anderson [1], Sitgreaves [8], and Kabe [6]. 

Wald [10] actually considered a class of statistics of which 
T -1 —(1) —(2) 

the statistic x S (x - x ) is a special case. For large values 
of n^ and n 2 this appears to be a reasonable statistic for classi- 
fication purposes. The distribution of the above statistic is given 
by Wald in terms of three quantities, say, m^, m 2 , m^* and an ex- 
pected value which he does not evaluate. 

According to Sitgreaves [8] the statistic V may be written as 

T -1 t -1 

V = ay-jA y 2 + by^A y 2 

where a and b are known scalars, y^ and y 2 are p-dimensional normal 
variates with E (y^) = and E(y 2 ) = £ 2 , and A is a P X P symmetric 
matrix having a Wishart distribution with n =. n^+n 2 degrees of free- 
dom. Furthermore y^,y 2 , and A are independently distributed with 
common covariance matrix E . . 

Anderson [1] obtained the distribution of m^, m 2 , m^ explicitly, 
including the evaluation of the expected value in Wald's result, for 
the special case when is proportional to C 2 * 

Sitgreaves [8] gave the distribution of m^, m 2 , m^ when S^.is 
proportional to £ 2 and included the normalizing constant of the dis- 
tribution which was not obtained by either Wald or Anderson. 

Kabe [6] obtained a further extension by finding the distribution 
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of m, , nu, m_ without assuming the proportionality of an ^ £>>• 

12-5 x c 

However, his result is not in closed form and appears to be very 

awkward to work with. 

Of primary concern in the classification problem is the probability 
P(i|j), i ^ j , of misclassify ing an individual I (x) from population 
j in population i. The complexity of the distribution of V makes 
it virtually impossible to compute P(i|j). An available option is 
to evaluate an estimate, P(i|j), by the Monte Carlo technique. 

This is the topic that concerns us in this paper. 


2 . Description of the method 

Suppose x-j 1 ^, x^ / • • • / x^ 1 * and x^ 2 ^ , x^ , . . . , x^ 2 ^ denote 

1 2 

two independent samples from two normal populations and with 

mean vectors and v respectively, and common covariance 

matrix I . If n 1 = ng then the distribution of V if x is from 
is the same as the* distribution of -V if x is from (see [2] , 
p. 135). The statistic U also has this property. For large values 
of n^ and n 2 the probability of classifying an observation from tt 2 
in is approximately 

(3) / — e -(x+»/2) 2 /2« dx 

0 /2'iTo 

since x is asymptotically distributed N (-a/2, a) when it comes 
ti 2 * Similarly P(2|l) is approximated by 


from 
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(4) 



(x- a /2) 2 /2a dx 


Hence, if x is classified in ^ when V > 0 and in ^ otherwise, 

P (2 1 1) = P Cl | 2) . 

The probability P (2 | 1 ) can be estimated by Monte Carlo 
methods for small ( and moderately large) values of n^ = n 2 = n. 

The process involves first sampling n training samples from each 
of two multivariate normal populations which differ only in the 
means, i.e., y ^ ^ y ^ 2 . These samples are then used to compute 
S ■*", x^ , and x^ 2 ^ . A sample of size m is generated from 
N(y^,E) and m values of V are calculated. The probability P(2|l) 
can then be estimated by - 

P (2 1 1) = k/m 

where k is the number of values of V that are negative. This pro- 
cess is repeated r times and the average of these r values of 

a 

P ( 2 J 1) is used as a final estimate of P(2jl). It is of interest 
to examine this probability for various values of ^ , a, and 

p, keeping in mind that the large sample value of P(2|l) is given 
by (4) . Values of a are obtained by keeping l fixed and varying 
the values of y^ and y ^ (or y ^ - y ^ ) . 


3. Monte Carlo Results 

Estimates of P(2|l) were generated for various values of 
n = = XI 2 , a, and p, where a = (y ^ - y^)E ^(y^ - y^ 2 ^) 
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* t . . 

reflects the separation of the two populations. The covariance 
matrices for tt^ and tt ^ were = Z = I. (No loss of generality 

is incurred by using I for the common covariance matrix since 
there exists an orthogonal transformation followed by a linear 
non-singular transformation that yields I.) 

The data was generated by means of the normal random generator 
described in [7]. The estimates of P ( 2 [ 1 ) where simulated for 
various choices of two populations by fixing the covariance matrix 
and varying a. The values a considered were a = 1,2,3,4,5,10,12, 
20,25. The training sample sizes considered were n = 5,10,15,20, 
50,100. For each choice of a, n, and p, the values for m and r 
were 50 and 50 respectively. Each observation was classified into 
if V >_ 0 and into otherwise. . 

A 

Table 1 gives the estimate P(2fl) as a function of n and' a 
for various values of p. Ncte that for every a, P(2|l) = P ( 2 ] 1) 
when n = 50 or 100, where P(2[l) is the asymptotic probability of 
misclassif icatior. given by (4) . 

Figures I-IV reflect the fact that P(2jl) is a decreasing 

/ 

function of a (for fixed n and p) . Figures V-VII reflect the fact 
that for fixed a and p the probability P (2 1 1) is a decreasing 
function of n. Also for fixed n, indications are that P(2|l) is 
an increasing function of p. 

The value of a is of primary concern since in a given situation 
the values of p and n are generally known. In an actual situation 
such as in a remote sensing application [5] it may be desirable 
to know approximately how many training samples are necessary so 
that classification based on these training samples will incur an 
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error of misclassif ication not exceeding certain preassigned 
bounds. In a remote sensing application one could estimate a from 
the data and thus get an estimate of P ( 2 { 1 ) from Table 1. 

The quantity p is also of importance. Its significance in 
relation to P ( 2 1 1 ) has already been observed in Table 1 and Figures 
V-VII. In agricultural applications of remote sensing data analysis 
a popular value appears to be p = 4, [3], [4], 
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TABLE 1 


A 

Estimated Probabilities of Misclassif ication, P(2|l) / 
for Values of a, n and p. 


a' 

p 

n 

5 

10 

15 

20 

50 

100 

CO 

1.0 

3 

6 

9 

12 

0.4192 
0.3960 
. 0.4176 

0.3672 

0.3824 

0.4052 

0.4084 

0.3568 

0.3632 

0.3884 

0.3775 

0.3432 

0.3520 

0.3556 

0.3960 

0.3252 

0.3200 

0.3350 

0.3524 

0.3110 

0.3208 

0.3150 

0.3290 

.308 

2.0 

3 

6 

0 

12 

0 . 3292 
0.3416 
0.3324 

0.2768 

0.3224 

0.3304 

0.3584 

0 .2724 
0.2940 
0 .3176 
0.3224 

0.2600 

0.2796 

0.3000 

0.3392 

'0.2552 

0.2444 

0.2572 

0.2820 

0.2460 

0.2337 

0.2670 

0.2730 

.238 

3.0 

3 

6 

9 

12 

0.2732 

0.2792 

0.2932 

0.2336 

0.2736 

0.2888 

0.3088 

0.2344 

0.2428 

0.2596' 

0.292/1 

0.2116 

0.2300 

0.2464 

0.2884 

0.2040 
0.2072 
0.2072 
0.226 8 

0.1875 

0.2050 

0.2020 

0.2050 

.19 2 

4.0 

3 

6 

9 

12 

0.2396 

0.2452 

0.2700 

0.2004 

0.2356 

0.2632 

0.2780 

0.1868 

0.2084 

0.2352 

0.2490 

0.1856 

0.1828 

0.2284 

0.2552 

0.1724 

0.1672 

0.1728' 

0.1916 

0.1620 

•0.2048 

0.1970 

0.1560 

.158 

6.0 

3 

6 

9 

12 

0.1872 

0.2276 

0.2352 

0.1472 

0.1652 

0.2156 

0.2260 

0.136 4 
0.1520 
0.1824 
0.1884 

0.1228 

0.1440 

0.1572 

0.1844 

0.1268 ! 0.1102 
0.1080 | 0.1522 

0.1224 j 0.1290 
0.1328 j 0.1040 

.111 

10.0 

3 

6 

9 

12 

0.1238 

0.1216 

0.1540 

0.0804 

0.1120 

0.1420 

0.1752 

0 .0744 
0.0864 
0.1104 
0.1352 

0.0728 

0.0716 

0.1008 

0.1100 

0.0728 j 0.0572 
0.0604 j 0.0770 
0.0644 ! 0.0630 

0.0748 | 0.0600 

.057 

12.0 

3 

6 

9 

12 

0.0868 

0.0884 

0.1368 

0.0556 

0.0916 

0.1196 

0.1580 

0.0560 

0.0664 

0.0844 

0.0968 

0.0460 

0.0580 

0.0656 

0.0856 

0.0440 j 0.0414 
0.0464 1 0.0632 

0.0584 ! 0.0540 

0.0516 ! 0.0540 

.041 

20. Q 

3 

6 

9 

12 

0.0528 

0.0460 

0.0768 

0.0212 

0.0428 

0.0596 

0.0784 

0 .0192 
0.0268 
0 .0332 
0.0308 

0.0188 

0.0188 

0.0300 

0.0420 

0.0156 

0.0196 

0.0176 

0.0156 

0.0124 
| 0.0171 
j 0.0120 
| 0.0140 

.012 

25.0 

1 

0.0272 

0.0336 

0.0588 

0.0140 
0.0 24 8 
0.0436 
0.0508 

0.0140 
0.0132 
0.0232 
0.024-1 • 

0.0112 
0.0104 
0.0152 
0.0256 . 

0.0072 
0 .0064 
0.0076 
0.0100 

0 .005.4 
0.0060 
1 0.0040 

| 0.0060 

.006 






























































Figure I 


Probability of misclassif ication versus a when p 
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Figure II 

Probability of misclassif ication versus a when p = 6. 
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Figure IV 

Probability of misclassif ication versus a when p = 12. 
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Figure V 

Probability of misclassif ication versus n when a = 1.0. 
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Figure VI 

Probability of misclassif ication versus n when a = 6. 
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Figure VII 

Probability of misclassif ication versus n when a = 20. 
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ABSTRACT 

The discriminate analysis problem is discussed briefly. 

An analytic formulation of the so-called Eppler (Table Look- 
up) algorithm is given along with a modification which equates 
the algorithm with the classical Bayes procedure. Simulation 
results comparing several discriminate analysis techniques are 
given. 
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ON THE TABLE LOOK-UP .IN DISCRIMINATE ANALYSIS 

P. L. Odell, B. S. Duran, and W. A. Coberly 
Texas Tech University 


1. Introduction 

Consider m populations tt^, an< ^ suppose each indivi- 

dual in the union of these populations possesses p common observ- 
able characteristics c, , c 0 ,...,c . The observed values of an 

1 2 p 

T 

individual are denoted by x = (x. , x 0 ,...,x) , where x. denotes 
the observed value of c ^ . Let p^(x), p 2 (x) , . . . ,p m (x.) denote m 
known multivariate probability density functions of the p-dimensional 
observation vector x and q^, q 2 , . . . be the known a priori proba- 
bilities that an individual, I, be selected from a population 
ir 2# ... ,ir , respectively. 

The classical discriminate analysis problem consists of formu- 
lating or developing a technique for assigning an individual selected 

m 

at random from U n . into one of the m populations. There have 

i=l 1 

been various techniques proposed for solving the problem, of v;‘. .ich 
the Bayesian solution is optimal, in the sense that it. minimizes 
the expected cost of misclassif ication . 

In various applications of discriminate analysis, for example 
in the analysis of remote sensing data [1] , the amount of computation 
involved is immense. Thus it seems desirable to either develop 
new techniques, modify existing ones, or to decrease the dimensions 
of the problem with the hope of maintaining approximately the 
optimality of the classical Bayes procedure. The dimensions of 

the problem can be decreased by means of characteristic selection 

/ 

[8] and/or data compression [12] techniques. These techniques allow 
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for reducing the value of p. Since the number of populations, m, 
is not arbitrary the only parameter which can be reduced is p, the 
number of characteristics. 

The characteristic selection procedure calls for selecting 

from the set of p characteristics q, q _< p, characteristics, say 

c. , c. ,...,c. , which are "best" with respect to identifying 
1 1 X 2 X q 

individuals from the populations ir^ / * • • / 7I m * It is important 

to note that complete enumeration of all possible choices of 
characteristics is practically impossible since the number of ways 
one can select q characteristics for 1 <_ q <_ p is 

2 p - 1 = ? Ei 

q ti 3 * 1 

If one lets the compression matrix B be a q * p matrix with q ones 

in positions (i . , i ^) , j = 1, 2,...,q, then Y = Bx is simply the 

T ... 

vector T = (x. . x. . ...,x. ) . Thus characteristic selection is 

X 1 x 2 x q 

a special case of data compression. 

Wilks [12] di'scusses a special type of data compression whereby 
k + 1 p-dimensional samples are projected into a q-dimensional space 
q < p, in such a manner that the k+1 projected samples are reason- 
ably well separated. The projection is actually carried out so 
that the pooled-sample' scatter is as large as possible relative 
to the within (total) - sample scatter. For example, one might 
desire to project three 3-dimensional sample points into a 2- 
dimensional or a 1-dimensional space. This is actually done in 
section 6 where various discrimination procedures are evaluated 
by Monte Carlo simulation. 
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A general approach in solving the discriminate problem is to 

define a distance between two populations, say D(i, j; c) where 

c= (c. , c. ,...,c. ), and then select a c such that the minimal 
X 1 x 2 1 q 

distance between any two populations and is maximized. One 
such distance function is divergence [6], [9] defined by 


D (i , j ) = / [p i (x) - p . (x) ] In [p^ (x) / p . (x) ] dx 

— CO 


which is of course an arbitrary choice for a distance function [5] , 
but is being used in at lease one large computer program for 
reducing remote sensing data [7]. It is not well known [2], [3] 

just how distance or divergence is related to misclassification , 
except in the case when the covariance matrices are equal. However, 
one is compelled intuitively to believe that the expected cost of 
misclassif ication should decrease with increasing pairwise distances 
between populations. 

Another technique, although developed heuristically [4], has 
proved successful in reducing the amount of computation involved 
in the solution of the discriminate problem. This technique, called 
the table look-up technique, is an approximation to the Bayes 
partition solution. In this paper we show that the table look-up 
technique can be modified so that it is a "closer" approximation 
to the Bayes procedure. The table look-up technique "trades off" 
floating point addition and multiplication for integer or fixed 
point addition in a table look up computer operations, thereby 
reducing the computing time from 2 units to 0.066 units in at least 
one empirical example [4], 
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In the classical Bayes procedure the probability density 
function p^(x) has to be evaluated for each observation vector x. 
The procedure discussed in this paper eliminates the need for 
computing p^ (x) for each observation vector x. The table look-up 
technique also utilizes a different set of characteristics from 
the p characteristics, for testing the membership of an individual 
in different populations. This concept was developed by Eppler, 
Helmke, and Evans [4] and they have shown empirically that their 
version in the form of a computing . algorithm leads to a significant 
decrease in computer time. 

We now consider the analytic development of the table look-up 
technique. 

2. Analytic^ Development of the Table Look-up Technique 

Let q be chosen a priori and p be known. Let c = (c^, c 

T 

c ) denote the q characteristics selected from the larger set of 
S[ 

p characteristics which maximize the minimal distance between each 
pair of populations and i = 1, 2,... / m; j = 1, 2 / ...,m, 

i ^ j, with respect to an a priori chosen distance function D(i,j). 
Let x^ denote the scalar measurement made on the characteristic 
such that 


a . 


i - x i - a i + n i d ' 1 = 1/ 2 '---' c 2' 


where a^ is known and x^ can take on only those values a^ + jd, 
j = 0, 1, 2,...,n.. For this choice of x. r s the measurement space 

1 q 1 . 

will contain H (n^ + 1) points. The measurement space S^, 
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which is a set of lattice points consisting of all possible measure- 
ment points can be written as 


(1) S g = (a^ , a 1 + n 1 d) 0 (a 2 , a 2 + n 2 d) 0 ••• 0 (a g , a g + n g d) 


In the case considered by Eppler, Helmke, and Evans, a^ = 0 , d = 1 , 
and n^ = 255 for all i = 1, 2 ,..., 111 , motivated by the units and 
manner in which the x . 1 s were measured. The number of points in 

when a^ = 0 and n^ = 255 for all i, is 256^, a very large number. 
Consider the region 


= {x; p^ <x) = max {p^ (x) } and p^(x) >_ max (T ^ ) } 


where each T ^ is an arbitrarily chosen threshold value. One may 
make a normality assumption and select such that 


( 2 ) 


{(x-m.) T n 1 (x- V i.) < C I I (x) e n,} = 1-a 
x x 1 — a x 


where 1-a is selected much as one would select a confidence coef- 
ficient in determining confidence intervals in statistical estimation 

0 

Statement (2) may be written equivalently as 


-f ^i 1 (x " y i } 


P {■ 


( 2 it ) p/2 i | 1/2 


(27 l ) p/2 H i | 1/2 


_1 r 

2 a | I (x) e 1 ,} = 1-a . 
e i 
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The threshold value is then given by 


T / 

i 


(2V p/2 H i | 1/2 



Since x ~ N(p i , ]\) , then x^ p ) = (x-y i ) T J^Cx-jk) is distributed 

chi-square with p degrees of freedom. The value of C q may simply 

2 

be read from a x table [10] from which the value T^ may be computed. 
* 

Let Rq = {x; p^(x) <_ T = max { T j > , i = 1, 2,...,m} be the region 
of no decision ,. that is, the region in which the information con- 
tained in the measurement vector x gives very little or no dis- 

a 

criminate information. The region Rq is not unlike the no decision 
region in classical statistical sequential testing [10]. If R is 
the p-dimensional space then 


S p = R 0 u R x U 


/N 



Let S(x;R) denote a storing transformation (storing operation) 
defined as follows 

% 

a 

S: x -*■ i if x e R. . 

x 

The table look-up technique is based on pre-storing in fast random 
access core memory the prescribed region R^, as i, for all points 
in Sp. That is, every vector x defined by 

R i = {x; (x-y i ) T JT 1 (x-n i ) 1 c a > 


and whose components x. e {a 


+ d, 


n jd) 


j = 1, 
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is placed in a table with x corresponding to i. The value of i 
is stored in the "x" location. Thus when x is measured, its loca- 
tion is looked up and if a value of i is found then I (x) is clas- 
sified into population tt^. The table look-up technique replaces 
the calculation in the classical discriminate technique with a 
retrieval operation for each observation x. The savings in time is 
then the difference in time to retrieve the population classifica- 
tion and calculation and ordering in the classical technique. 

It is important to note that except for the introduction of 

* 

the region of no decision R^ , the table look-up technique is a new 
(different) computational technique for performing the Bayes 
Algorithm. By selecting T = max(T^) sufficiently small, Rq = £f, 
the empty set, and the table look-up technique is simply a clever 
way to perform the Bayes Algorithm. This last statement is sub- 
stantiated in the next section. 


3. Comparison of the Table Look-up Technique with Bayes Algorithm 
Let the p-dimensional Euclidean space R be partitioned into the 
Bayesian discriminate partition R = (R^ , R 2 ,...,R m ), where R^ is 
defined 

\ " {x; r k 

where 

m 

r i = I (*) C(i| j) 

1 j=l 3 J 


= min (r . } } 
l<i<m 1 


/ 
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/ 


and C ( i | j ) denotes 
as coining from tt^. 
given by 


the cost of classifying an observation from r, ^ 
The probability of proper classification is 


P (i | i; R) = / p . (x) dx. 

R. 

1 . . 

Let R. be a subset of R. , that is R. c R. , i = 1, 2,...m, and define 

1 X X X 

• . m ^ 

. Rft = R - U R. , 

u i=l 1 

to be the no decision region. Then 

P(i|i, R) = / p, (x) d x < / p. (x) dx = P(iji, R) 

R i R i 

A A 

and the probability of not making a decision is P (x e Rq ; R) . Note 

* A A A 

that if R = R, then P (x £ Rq ; R) = 0 since Rq = J2f, the empty set. 

Let S (x; R) be a storing operation such that 

A „ A 

S(x;R) :x-*i if xeR^. 

When an observation x is taken on an individual I (x) one searches 
through the storage for the range of S (x; R) which is equivalent 
to determing the integer i for which x c R^. If x e R^ or equiva- 
lently , if i is stored in the "x" location in storage, then we 
assign I (x) to population tt^. 

Since there exists a continuum of x's in the interval a <_ x b, 
the memory requirements are infinite. However, due to the manner 
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in which the data is taken, x takes on only a finite number of 
vector values. That is, there exists only a finite number of values 
that each x^ in the vector x = (x^, x 2 ,...,x p ) can take on. The 
possible values for each x^ are given by 

x^ = + j d ; j = 0 » 1-/ 2 , . . . ,n^. 

./ 

In the remote sensing application for example, a^ = 0 for i = 1, 2,. 
m, d = 1, and n^a- 255, for all i. Hence, the number of storage 

locations required is 256^. This figure is very large indeed, 
even for small values of p. However, there are ways of reducing 
this number substantially. Comments on this item and other feasible 
and practical aspects of the Table look-up technique are discussed 
in section 7. 

The foregoing results are summarized in the following theorem 
and corollary. 

A 

THEOREM: Let R = (R. , R~,...,R ) be a Bayes partition and R = 

x « lu * 

AAA A 

(Rq, R 1 ,...,R m ) be any other partition such that R^ c R^, i = 1, 2,. 

A 

m. Then P (i|i, R) P (i|i, R) if C (ijj) = C for all i. ^ j. 

A A 

COROLLARY: Let Rq tend to the empty set and tend to R^. Then 

A 

R tends to the Bayes partition R. 

The problem in using the table look-up technique reduces to 
that of selecting the partition R = (Rq , R^,...,R m ) which minimizes 
computer time and storage requirements but yet approximates the 
optimality of the Bayes partition sufficiently closely. Thus the 
problem is to select that partition R which maximizes computer 
efficiency. 
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TABLE LOOK-UP 
USING BEST 4 
CHARACTERISTICS 
OUT OF NINE FOR 
EACH POPULATION 

BAYES TECHNIQUE 
USING FOUR BEST 
CHARACTERISTICS 
OUT OF NINE FOR 
ALL POPULATIONS 

BAYES TECHNIQUE 
USING SIX BEST 
CHARACTERISTICS 
OUT OF NINE FOR 
ALL POPULATIONS 

TIME TO 

CLASSIFY A 222- 
SAMPLE LINE 

0.066 SEC 

2.0 SEC 

4.0 SEC 

ACCURACY 

CORRECT 92.4% 
UNDECIDED 3.2% 
INCORRECT 4.4% 

CORRECT 93.1% 
UNDECIDED 0.7% 
INCORRECT 6.2% 

CORRECT 95.0% 
UNDECIDED 0.0% 
INCORRECT 5.0% 

ARITHMETIC 
OPERATIONS 
REQUIRED BY 
ALGORITHM 

INTEGER 

ADDITION 

FLOATING-POINT 
ADD AND MULTI- 
PLY 

FLOATING-POINT 
ADD AND MULTI- 
PLY 


Table 1. Comparison Between Table Look-up and Bayes Approaches. 
4. Modification of the Table Look-up Procedure. 


There exists at least one competing algorithm to the table 
look-up algorithm. We consider one such algorithm which is sug- 
gested in an attempt to minimize storage requirements but yet re- 
tain the desirable properties of the table look-up technique. Let 
q <_ p denote the number of characteristics to be used in a dis- 
criminate analysis. The following two alternatives are available 
in choosing q. 


(1) Select the q "best" characteristics for the union 
of the m populations and perform a table look-up 
algorithm using measurements on these characteristics 

(2) Select the q "best" characteristics for each popula- 
tion and for each such choice perform a table look- 
up algorithm. 





Eppler, Ilelmke, and Evans [4] have given some computational 
comparisons between the Bayes and Table look-up techniques [see 
Table 1] . They selected the four best characteristics from a set 
of nine using real data from the Purdue experiment and found that 
they were able to decrease computer time by a factor of 32 to 1. 
However, storage requirements apparently remained a problem so they 
introduced a scaling transformation to produce a coarse set of 
lattice points which they called "pointer scale". Arguments are 
given in [4] to assure us that little is lost by modifying the al- 
gorithm to include a coarse lattice. These arguments seem reasonable 


but one should remember that primitive (i.e. R = R) table look-up 

implies relative large storage requirements. 

As an alternative procedure one may consider the following 

modification of the table look-up procedure. Let R = (R, , R„,...,R ) 

12m 

A A A 

be the optimal Bayes partition and R = . (R Q , he an Y table 

look-up partition. Let 


where 


R 0 R 10 U R 20 U • ■ ■ u R ra o 


R io = R o n R i 


1 A 
is the intersection of the l Bayes region R^ and Rq. If R^ is 

A 

such that R^ c: for all i, then 


P (i i; R) > P (i | i ; R) . 


Let us select as R^ the largest p-dimensional rectangle in 
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/ 


r , with planar boundaries parallel to the coordinate planes, which 

A a 

contains as much of R^ as possible, including ' the center of R^ . 

Then for a^ <_ x <_ b^, i = 1, the probability of proper 

classification, is 


P (i|i, R) = / 
a 


b il 


il 


/ P i (x) dx 

d . 

ip 


which one wishes to be such that the approximation error 
(3) P (i | i ; R) - P (i|i; R) = e.^ 


is small. 

Now, since the planar sides (bounds) of R^ are parallel to 
the coordinate planes of the p-dimensional space one needs only to 
find those bounds such that if x e R^ then I (x) is assigned to ir^. 

Let the bounds of R^ be given as a a^ <_ x <_ b^ for i = 1, 2,..., 
m. Then a modification of the table look-up procedure which uses 
the rectangles R^, R 2 /...R as an approximation to the Bayes Partition 
may be summarized in the following algorithm. 

Step 1. If the observation vector is such that a^ _< x <_ b^, 
then I (x) is assigned to JK. 

Step 2. If x < a. or x > b. then replace i with i ' ^ i and 
go to Step 1. 

Step 3. Repeat the algorithm m times and if x / R^ for i = 1, 

2,...,m then assign I (x) to Rq , the no decision region. 

This modification of the table look-up algorithm can also be 
employed by using the two alternatives in choosing the q "best" 
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characteristics from the p characteristics. 

If the errors e^ in (3) are not sufficiently small, then the 
approximation could be improved by choosing two disjoint rectangles 

^ A 

R . , = R. and R._ such that R.,U R. 0 = R. contains more of R. 

il i x2 il i2 i x 

than did and 


R i = R iO U R il U R 12 ’ 


An algorithm similar to the one above would hold for the case 
of two rectangles R.^ and R^* In f act ' a union of p-dimensional 
rectangles could be used as an approximation to R^ and the appro- 
priate algorithm could be formulated. However, the amount of in- 
crease in classification accuracy might be so small as to not 
warrant such a venture. 

The algorithm above places an individual I (x) in the no decision 
region if x / R. for i = 1, 2,...,m. The results of Table 1 in- 
dicate that the table look-up places 3.2% of the cases in the no 
decision region for that particular example. For any observation 
falling in the no decision region the Bayes procedure could be 
used. Thus all individuals would be classified and the procedure . 
involved would be as optimal as the Bayes procedure and the com- 
puter time involved would be less than the time required for the 
Bayes procedure alone.. 

A last item to note about the modification for the table look- 
up technique is that in using the p-dimensional rectangle approach, 
the need to store values of i for each x is eliminated and replaced 
with an ordering procedure. That is if a^ < x < then x e R^ c R^ 
and I (x) is assigned to 
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Evaluation of Various Discriminate Techniques 


In the application of discriminate analysis to large sets of 
data, such as in the remote sensing application, it is extremely 
important that one is able to select the "best" available procedure. 
The ideal situation would be to have an optimal procedure that 
can be performed in the least amount of time. However, this is 
never the case. 

A methodology for ranking existing discriminate techniques is 
lacking; however, we will attempt to rank several techniques which 
have been mentioned and/or discussed in the previous sections. The 
suggested rankings involving these techniques will be obtained 
merely by how one would expect them to perform on the basis of the 
way they are defined. For example, a table look-up procedure takes 
less time to perform than a Bayes procedure; however, the Bayes 
procedure is more accurate. In the next section several of these 
techniques will be examined by means of Monte Carlo simulation. 

One can then check to see how those results bear out some of the 
results in this section. 

The evaluations in this section are in reference to (1) 
accuracy, (2) computing speed, and (3) storage requirements. The 
discriminate techniques that will be considered are: 

T^ : A Bayes algorithm using data compressed by means of 

Wilks concepts [12] 

T 2 : A Table look-up technique using the same p characteristics 

for all populations tt^. 

T^: A Table look-up technique using the best q (q < p) 

characteristics for all populations. See (1) of section 4 
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: A Table look-up using the best q (q < p) characteristics 

for each population tt^. See (2) of section 4. 

T c : A p-dimensional rectangle approximation using the same 

D 

p characteristics for all populations. 

T^: A q-dimensional rectangular approximation using the best 

q characteristics for all populations. 

T^: A q-dimensional rectangular approximation using the best 

q characteristics for each population tt^. 

'Tg(j): Let j = 2 ,...,7 and Tg(j) denotes the T. algorithm 

with the modification that a classical Bayes procedure 

A 

is performed if x e Rq. 

T q : Classical Bayes algorithm in which p.(x), i = 1, 2,...,m 

y i 

are known. ... . 

T^q: Classical Bayes algorithm in which p^(x), i = 1, 2 / ... / m 

are assumed normal with unknown parameters y^ and . 

T ll : Classical Bayes algorithm in which p^(x), i = 1, 2 / ... / m 

are unknown and must be estimated "nonparametrically " . 

There are other techniques but we will restrict ourselves to 
these. The Bayes ^technique using the best q (q < p) characteristics 
for all populations is not included, however, it is compared with 
the Table look-up in Table 1. The Bayes solution is given by 
Tg, Tg(2), and Tg(5) when the probability density functions p^(x), 
i=l,2,...,m are known. Hence, it is meaningless to ask which is 
the more accurate. However, the difference in characteristic selec- 
tion implies that T g (2) >_ Tg(3) * T Q (4) - T 8 ^ • T 8 ^ - T 8^ * and 
Tg(5) >_ Tg(6) where the symbol means "is as accurate as". If 

** A 

one can determine the fact that then one can say that the 

table look-up technique is as accurate as the p-dimensio.nal rec- 
tangular approximation technique. In this case Tg(2) > Tg ( 5 ) > Tg (6 ) 
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and Tg (4 ) >^ Tg (7) . 

Other orderings with respect to accuracy are: 

(a) Tg (4) i T 4 , 

(b) t 2 >_ t 3 , 

(c) T{7) >_ T ? , 

(d) t 5 > t 6 , 

(e) T ? >_T e , 

(f) T 8 (6) > T fi# 

(g) T g (5) > T 5 , 

(h) Tg(3) >T 3 , 

(i) Tg (2) >_ T 2 , 

(j) T g > T 1q > T 11 when p.^ (x) , i = 1, 2,...,m are known and noriral 

(k) T 1q >_ T^ when p^(x), i = 1, 2,...,m are normal with 
unknown parameters , 

(l) T 1q and T^ 1 cannot be compared when p^ (x) , i = 1, 2,...,m 
are not known since the accuracy will depend on how far 
the (x ) , i = 1, 2,...,m are from being normal. 

(m) T^ is uniformly less accurate when compared to every other 
member of the list. 

Following are several orderings with respect to computing speed . 

In this case " < > w means "takes no more time than". 

(a) T 2 > T 3 T 4 >_ Tg (4) > T x > {T 9 , T 1q/ T 11 } / 

(b) Tg (2 ) > Tg (3) >_ Tg(4) >_ T 1 > ' {Tg , T 1Q# T^}, 

(c) T 3 > Tg(3) > Tg(4) > T 1 > {Tg, T 10 , T 11 } / 

(d) T 5 > Tg ^ T ? > Tg(7) >.T X >_ {Tg, T 1q , T^}, 

(e) Tg (5) >_ Tg (6 ) >_ Tg (7) >_ > {T g/ T^}, 

(f) Tg > T g ( 6 ) , 
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(g) whether T 2 T^. depends on the speed required to "compare" 
with the speed required to "look up". 

A preliminary analysis was performed by Data Processing Branch 
and Systems Engineering Branch personnel, National Aeronautics and 
Space Administration, Manned Spacecraft Center in conjunction with 
Control Data Corporation, to estimate the Computer Processing Unit 
time required to perform pattern recognition using the Purdue 
University LARS (Laboratory for Applications of Remote Sensing) 
classification technique. The LARS technique assumes the data from 
each class is drawn from a multivariate normal population and the 
classification is done according to the maximum likelihood rule. 

That is, an observation x is classified in population k if 

A A 

p, (x) = max (p.(x)}, k = 1, 2,...,m, where the multivariate density 

K • X 

X 

functions p^ (x) , i = 1, 2,...,m are evaluated using estimates of the 
mean vector and covariance matrix for each class. Thus the LAIS 
technique is the classical Bayes technique T 1Q with equal priors, 
that is, q^ = 1/m, i = 1, 2,...,m. The results of the analysis 
are - summarized in Figure 1. The graph in Figure 1 presents the 
CYBER 73-14 time required to classify 16*10^ picture elements in 
a remote sensing data situation. The graph gives the time required 
to classify elements given the member of classes (populations) to 
be separated and the number of channels (variates) to be used in 
the classification process. 

In a remote sensing data situation the data is obtained in 
the form of an image (or scene) which is a rectangular region con- 
sisting of r rows (scan lines) and c columns (number of resolution 
elements or cells per scan line). Each cell generates a p-variate 
observation. Thus to recognize a scene one must perform rc 
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discriminate tasks, i.e. one must classify rc observations. 

The graph in Figure 1 gives the time required to classify 

6 3 

16*10 observations (elements), where, for example, r = 4'10 

3 

and c= 4 '10 . Data of this magnitude is quite common m a remote 
sensing data situation. 

6 . A Monte Carlo Evaluation . 

The techniques evaluated by Monte Carlo Techniques are: 

1. (T^) A Bayes algorithm using Wilks concepts [12] . 

2. (T 2 ) A table look-up technique. 

3. (T^) A table look-up technique. 

4. (T,-) A p-dimensional rectangular technique. 

5. (Tg) A q-dimensional rectangular technique. 

6. ( T g) The classical Bayes technique assuming known 

parameters . 

7. (T 10 ) The classical Bayes technique using estimated 

parameters. 

The seven techniques listed above have been taken from the 
list in section 5. For the Monte Carlo simulation we took m = 3, 
p = 3, and n = 100 samples from each population were generated 
using the multivariate normal random generator in [11] . Three 
separate trials were conducted corresponding to three different 
sets of multivariate normal populations. Following are the 
vector means and covariance 
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matrices for each trial, 
Trial I: 


and 


Trial II 


V 1 = (150, 200, 100) T , 

T 


y 2 = 

(100, 

150, 

200) 

■E 

W 

II 

(200, 

100, 

150) 

II - 

II 

CN 
C — J 

II 

: — j 

625 I 

= 

(150, 

200, 

100) 

■c 

II 

(100, 

150, 

200)' 

y 3 = 

(200, 

100, 

150)' 


and 


Ii-I: 


l 


I 


\ 


\ 


625 

375 

0 

375 

62 5 

375 

0 

375 

625 

625 

-375 

0 

375 

625 

-375 

0 

-375 

625 


\ 


\ 


l 


Trial III: 


M 1 = (125, 150, 175) T , 
U 2 = (150, 175, 125) T , 
V 3 = (175, 125, 150) T 


/ 
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J 

1 400 

-240 

-200 

t'-'J 

H 

II 

-240 

400 

360 

1 

^-200 

360 

400 

1 

1 400 

240 

-200 

to 

II 

240 

400 

-360 

1 

^-200 

-360 

400 


f 400 

-240 

200 

la" 

-240 

400 

-360 


l 200 

-360 

400 


The results of the discriminate analysis results are giver, in 
Table 2, 3, and 4. There were a total of 24 discriminate analyses 
performed. The techniques labeled 1 and la in the tables denote 
Bayes procedures with the data compressed by Wilks technique from 
3 variates to 2 and 3 variates to 1 , respectively. For technique 
number 5(Tg) the q-dimensional rectangular technique was used for 
the choice of the best 2 variates from the 3 original vafiates. 

Tables 2, 3, and 4 give the number of correct classifications 
in each population, the number of misclassif ications , the number 
not classified, and the amount of time (in .01 sec.) in each analysis 
for each trial, respectively. 

The orderings with regard to accuracy in section 6 are supported 
by the Monte Carlo results for the 7 techniques used. 
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However, all the. orderings with regard to computer time are 
not, strictly speaking, supported. For example, section 6 has 
as taking no more- computer time than (T^ >. T^) . However, for all 
three trials took less time than , although the difference was 
very small, and the computer ordering with respect to time could 
be due to certain factors such as language used (Fortran), pro- 
gramming procedures, and so on. 


Table 2. Trial I. 


Technique 



1 

la 

2 

3 

4 

5 

6 

7 

Classified in 

99 

100 

* 97 

97 

79 

91 

99 

100 

Classified in tt 2 

100 

98 

97 

96 

80 

91 

99 

99 

Classified in 

98 

99 

98 

96 

83 

94 

100 

100 

Misclassif ied 

3 

3 

2 

5 

0 

2 

2 

1 

Not classified 

* 0 

0 

6 

6 

58 

22 

0 

0 

Time in .01 sec.* 

542 

508 

288 

278 

287 

278 

750 

718 


* Includes Input-Output time of 2 seconds. 


/ 
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Table 3. Trial II. 
Technique 



Hi 

m 

2 

3 

4 

5 

6 

7 

Classified in 

100 

100 

96 

99 

54 

87 

99 

99 

Classified in tt 2 

99 

98 

94 

98 

56 

86 

100 

100 

Classified in 

100 ‘ 

70 

90 

96 

81 

96 

98 

98 

Misclassif ied 

1 

32 

2 

4 

9 

11 

3 

3 ' 

Not classified 

0 

0 

18 

3 

94 

29 

0 

0 

Time in .01 sec.’ 

f 560 

457 

283 

282 

287 

283 

660 

662 


* Includes Input-Output time of 2 seconds. 


Table 4. Trial III. 


Technique 



H 

Hi 

2 

3 

• 4 

5 

6 ' 

7 

Classified in tt^ 

' 94 

69 

94 

90 

25 

85 

95 

95 

Classified in 

91 

75 

93 

76 

23 

89 

94 

95 

Classified in 

84 

84 

88 

83 

19 

80 

89 

90 

Misclassif ied 

31 

72 

12 

39 

1 

34 

22 

20 

Not classified 

0 

0 

13 

11 . 

232 

12 

' 0 

0 

Time in .01 sec.* 

578 

463 

287 

280 

9 0 9 

302 

650 

66 8 


* Includes Input-Output time of 2 seconds. 
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7. Feasibility and Practical Aspects of Table Look-Up . 

There are various instances when the Table look-up ceases 
to be a practical technique. One such instance is when the num- 
ber of values to be computed for the table exceeds the number of 
values to be classified. Three ways in which to reduce memory 
requirements for the Table look-up technique are 

(1) reduce the number of variates p, 

(2) store only regions of interest in the measurement 
space, and 

(3) compress the regions of interest. 

Item (3) involves storing the classification for several conti- 
guous locations all in a single core memory location. This is the 
transformation discussed in [4] which produces a coarse set of 
lattice points and is called "pointer scale". In [4] it is seen 

how storage requirements for a table for one population ?'*e re- 
2 

duced from 256 = 65,536 to 864 and from 864 to 144 by successively 

using (1) , (2) , and (3) above. 

In our simulation study the table for each population consisted 
of a p-dimensional lattice cube having 12 points to a side. This 
called for 12 p = 12^ = 1723 storage locations. The total storage 
requirements in our case were then m(12) p = 3(12) J = 5184 locations. 
In using the best 2 of 3 characteristics for each population the 
storage requirements were m(12)^ = 3(12)^ = 432 locations. Each 
of the lattice points was classified by the Bayes procedure prior 
to classifying the 300 observations resulting in 5184 classifications 
(p = 3) or 432 classifications (q = 2). In these cases the number' 
of classifications necessary to construct the table . exceeds the 
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number (300) of further classifications. This was done to get 
information regarding accuracy, speed, and storage requirements. 

In practice one could be faced with classifying data of the magni- 
tude of 10^ observations, such as in remote sensing, in which 
case the Table look-up technique would be quite practical. 

In summary ; there are cases when the Table look-up would prove 
quite useful. 

8. Concluding Remarks 

From the evaluation in section 5 and the simulation results 
it appears that the table look-up technique has much to recommend 
it, especially if all p variates are used and if all observations 
x falling in the no decision region are classified according to 
the classical Bayes procedure. The procedure Tg(2) had 6, 18, and 
13 observations for trials I, II, and III, respectively, falling 
in the no decision region. Procedure Tg(5) had 22, 29, and. 12 
observations falling in the no decision region. All these observa- 
tions could have been classified according to the classical Bayes 
procedure and would have made the results as accurate. 

If the number of observations fallinc, in the no decision region 
is large, say 30% or more, then Tg(2) and Tg(5) would take consider- 
ably more time than T ^ and T^. It would be useful in cases like 
these to use the p-dimensional rectangular approach where two or 
more rectangles in each R^ are utilized. 
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